NPS Inventory and Monitoring Division Intro to R Training: January 31 – February 2, 2022

Prep for Training

Installing required software

The only thing you really need to do to prepare for R training is to install the latest version of R and RStudio. We’ll talk about the difference between R and RStudio on the first day, but for now, just make sure they’re installed. Directions for installing R/RStudio are below. If you run into any problems, check the instructions at R for Data Science Section 1.4.1 and 1.4.2.

NOTE: If you don’t have Administrative privileges on your computer, you will have to submit an IT HelpDesk Ticket (need to be on VPN or NPS network to access this link) to install R and RStudio, which could take a while. PLEASE PLAN AHEAD!

Even if you already have R or RStudio installed, please install the latest versions of both programs. R recently went through a major version change from 3.x to 4.x with some potential code-breaking changes. The latest versions are needed to ensure everyone’s code behaves the same way.

Install the latest version of R (4.1.2 as of January 2022) Assuming you’re on a PC, the installer for the latest version of R can be downloaded from The Comprehensive R Archive Network (CRAN). You’ll want to download the R installer by clicking on “Download R 4.0.4 for Windows”, or whatever the latest version number happens to be. After it downloads, open and run the installer with default settings. Note that there’s no need to pin R to your taskbar or add a shortcut to your desktop. For 99% of your needs, you’ll run R within RStudio, and won’t ever open R directly.


Install RStudio The installer for RStudio for PCs can be downloaded here. You’ll need to click on the large blue “DOWNLOAD RSTUDIO FOR WINDOWS” button to download the installer. After it downloads, open and run the installer with default settings. I like having RStudio pinned to my taskbar, so it’s easier to find/open, but it’s up to you and your preferences.


Optional Reading We know you’re all busy and have to prioritize how you spend your time. If you have any time before training starts, we highly recommend reading Chapter 2: R basics and workflows in STAT545. This is an online book based on a graduate level statistics class that Dr. Jenny Bryan teaches and is turning into this book. She’s a stats professor turned senior programmer at RStudio, and I really like how she approaches programming R in this book.


Structure of training

The training will take place over 4 half days. Each day will run from 10-2 MST via MS Teams. The hour before training, and the hour after training will also be posted by at least one trainer as office hours, in case there are questions that couldn’t be handled during training.

Each training session has three trainers assigned, two of which will be the main instructors. The third trainer will manage the chat. For most of the training, one of the trainers will share their screen to walk through the website and then demo with live coding. It will help a lot if you have 2 monitors, so you can have the screen being shared on one monitor and your own session of RStudio open on the other.

Four days is barely enough to scratch the surface on what you can do in R. Our goals with this training are to help you get beyond the initial learning curve that can be really challenging to climb on your own, and to expose you to some of the really useful things R can do for you in your work. The trainers put lot of thought and time into designing this training. We did it because we enjoy coding in R and it has greatly increased efficiency and productivity in our jobs. We hope you have a similar experience as you start out on this R learning curve. As you learn more, please share your skills and code with others to help us build a larger community of R users in IMD who can learn from each other.

Finally, to help us improve this training for future sessions, please leave feedback in the training feedback form.


Day 1: Intro to R

Intro to R & RStudio

Some Notes on Coding

Welcome to the class!

Learning (or re-learning) how to program can feel like an intimidating learning curve. Congrats on taking the first step! Before we delve into technical topics, let’s talk a little about coding in general.

It’s okay to make mistakes

Errors and broken code are not only part of the learning process, they’re part of the coding process. If you think your code is running perfectly on the first try, you probably didn’t test it thoroughly enough. It can be frustrating and discouraging when you can’t get your code to work the way you want it to, but getting stuck doesn’t make you a bad programmer. Try to approach broken code as a puzzle and a learning opportunity.

Use the resources available to you

As the saying goes,

Good coders write good code; great coders steal great code.

Google and Stack Overflow are great resources when you don’t know off the top of your head how to do something. And reach out to your colleagues too - there’s no reason to waste time writing code that someone else has already written! Just give credit where credit is due, and take some time to make sure you understand what the copied code does. And especially don’t hesitate to reach out to more experienced colleagues with questions.

About R

R is a programming language that was originally developed by statisticians for statistical computing and graphics. R is free and open source, which means you will never need a paid license to use it, and anyone who wants to can view the underlying source code and suggest fixes and improvements. Since its first official release in 1995, R remains one of the leading programming languages for statistics and data visualization, and its capabilities continue to grow.

Every programming language has strengths and weaknesses. We like R because:

  • It is great for statistics and data exploration
  • It is good at producing publication-quality plots and tables
  • There are many well-maintained add-ons (called packages) that add useful features to R (this website was built using the rmarkdown package)
  • R is well-documented, with many learning resources available whether you are a beginner or an advanced user

For more information on the history of R, visit the R Project website.

About RStudio

RStudio is what’s called an integrated development environment (IDE). When you install R, it comes with a simple user interface that lets you write and execute code. Writing code in this interface is kind of like doing word processing in Notepad: it’s simple and straightforward, but soon you’ll start wishing for more features. This is where RStudio comes in. RStudio makes programming in R easier by color coding different types of code, autocompleting code, flagging mistakes (like spellcheck for code), and providing many useful tools with the push of a button or key stroke (e.g. viewing help info).

When you open RStudio, you typically see 4 panes:

RStudio

  • Source:
    This is where the code gets written. When you create a new script or open an existing one, it displays here. In the screenshot above, there’s a script called bat_data_wrangling.R open in the source pane. Note that if you haven’t yet opened or created a new script, you won’t see this pane until you do.
    The source pane will helpfully color-code your code to make it easier to read. It will also detect syntax errors (the coding equivalent of a grammar checker). These are flagged with a red “x” next to the line number and a squiggly line under the offending code. When you’re ready to run all or part of your script:

    • Highlight the line(s) of code you want to run
    • Either click the “Run” button (top right of the source pane) or press Ctrl+Enter. At this point, the code gets sent to the console (the bottom left pane). You’ll see your code appear in the console, and then you’ll see the output if there is any.
  • Console:
    This is where the code actually runs. When you first open RStudio, the console will tell you the version of R that you’re running (should be R 4.1.0 or greater). We talked above about how you can run a script at the console. You can also type code directly into the console and run it that way. That code won’t get saved to a file, but it’s a great way to experiment and test out lines of code before you add them to your script in the source pane.
    The console is also where errors appear if your code breaks. Deciphering errors can be a challenge sometimes, but Googling them is a good place to start.

  • Environment/History/Connections:

    • Environment: This is where you can see what is currently in your environment. Think of the environment as temporary storage for objects - things like datasets and stored values - that you are using in your script(s). You can also click on objects and view them.
      Remember that anything you see in your environment is temporary and it will disappear when you restart R. If there is something in your environment that you want to access in the future, make sure that your script is able to reproduce it (or save it to a file).
    • History: This shows the code you’ve run in the current session. It’s not good to rely on it, but it can be a way to recover code you ran at the console and later realized you needed in your script.
    • Connections: This is one way to connect R to a database. We’ll talk about this more later on.
    • Git: If you have installed Git on your computer, you may see a Git tab. We won’t talk much about it this week, but this is where you’ll keep track of changes to your code. You are encouraged to attend the Git/GitHub session during Week 2 if you’d like to learn more!
    • Tutorial: This has some interactive tutorials that you can check out if you are interested.
  • Files/Plots/Packages/Help/Viewer:

    • Files: This tab shows the files within your working directory (typically the folder where your current code project lives). More on this later.
    • Plots: This tab will show plots that you create.
    • Packages: This tab allows you to install, load, and update packages, and also view the help files within each package. This is kind of like the R equivalent of an app store. You can also do this with code.
    • Help: Allows you to search for and view documentation for packages that are installed on your computer.
    • Viewer: Shows HTML outputs produced by R Markdown, R Shiny, and some plotting and mapping packages.
Global Options

There are several settings in the Global Options that everyone should check to make sure we all have consistent settings. Go to Tools -> Global Options and follow the steps below.

  1. Under the General tab, you should see that your R Version is [64-bit] and the version is R-4.1.0 or greater. If it’s not, you probably need to update R. Let an instructor know if you have questions about this.
  2. Make sure you are not saving your environment. To do this, uncheck the Restore .RData into your workspace at startup option.
    When this option is checked, R Studio saves your current working environment (the stuff in the Environment tab) when you exit. The next time you open R Studio, it restores that same environment. This seems like a good thing, but the whole point of using R is that your code should return the same results every time you run it. Clearing your environment every time you close RStudio forces you to run your code with a clean slate so that you can be sure that it will work for everyone, not just you and your specific environment.
    If you sometimes work with huge datasets that take a long time to load and process, you may want to set Save workspace to .RData on exit to “Ask”. When you close RStudio, it will ask you if you want to save your workspace image. If you have a huge dataset (or the products of a huge dataset) in your environment, select “Yes” and then run load(".RData") at the console the next time you open RStudio to restore everything.

Most other settings are whatever you prefer, including everything in the Appearance window.

Starting a Project

When you’re working on a code project it’s important to stay organized. Keeping code, input data, and results together in one place will protect your sanity and the sanity of the person who inherits the project. R Studio projects help with this. Creating a new R Studio project for each new code project makes it easier to manage settings and file paths.

Before we create a project, take a look at the Console tab. Notice that at the top of the console there is a folder path. That path is your current working directory.

default working directory

If you refer to a file in R using a relative path, for example ./data/my_data_file.csv, R will look in your current working directory for a folder called data containing a file called my_data_file.csv. Using relative paths is a good idea because the full path is specific to your computer and likely won’t work on a different computer. But there’s no guarantee that everyone has the same default R working directory. This is where projects come in.
Let’s make a project for this class. Click File > New Project. In the window that appears, select New Directory, then select New Project. You will be prompted for a directory name. This is the name of your project folder. For this class, call it imd-r-training-intro. Next, you’ll select what folder to keep your project folder in. Documents/R is a good place to store all of your R projects but it’s up to you. When you are done, click on Create Project.

Create new project step 1 Create new project step 2 Create new project step 3

If you successfully started a project named imd-r-training-intro, you should see it listed at the very top right of your screen. As you start new projects, you’ll want to check that you’re working in the right one before you start coding. Take a look at the Console tab again. Notice that your current working directory is now your project folder. When you look in the Files tab of the bottom right pane, you’ll see that it also defaults to the project folder.

Let’s start coding

First we need to create a new R script file. Make sure you are working in the imd-r-training-intro project that you just created. Click on the New File icon new file icon in the top left corner. In the dropdown, select R Script. The source pane will appear with an untitled empty script. Go ahead and save it by clicking the Save icon (and make sure the Source on Save checkbox is deselected). Call it my_first_script.R.

The basics

We’ll start with something simple. Basic math in R is pretty straightforward and the syntax is similar to simply using a graphing calculator. Try it out! You can use the examples below or come up with your own. Even if you’re using the examples, try to actually type the code instead of copy-pasting - you’ll learn to code faster that way.

To run a single line of code in your script, place your cursor anywhere in that line and press CTRL+ENTER (or click the Run button in the top right of the script pane). To run multiple lines of code, highlight the lines you want to run and hit CTRL+ENTER or click Run.

Text following a hashtag/pound sign (#) will be ignored - use this to leave comments about what your code is doing. Commenting your code is one of the best habits you can form, and you should start now! Comments are a gift to your future self and anyone else who tries to use your code.

# By using this hashtag/pound sign, you are telling R to ignore the text afterwards. This is useful for leaving annotation of notes or instructions for yourself, or someone else using your code

# try this line to generate some basic text and become familiar with where results will appear:
print("Hello, lets do some basic math. Results of operations will appear here")
## [1] "Hello, lets do some basic math. Results of operations will appear here"
# one plus one
1+1
## [1] 2
# two times three, divided by four
(2*3)/4
## [1] 1.5
# basic mathematical and trigonometric functions are fairly similar to what they would be in excel

# calculate the square root of 9
sqrt(9)
## [1] 3
# calculate the cube root of 8 (remember that x^(1/n) gives you the nth root of x)
8^(1/3)
## [1] 2
# get the cosine of 180 degrees - note that trig functions in R expect angles in radians
# also note that pi is a built-in constant in R
cos(pi)
## [1] -1
# calculate 5 to the tenth power
5^10
## [1] 9765625

Notice that when you run a line of code, the code and the result appear in the console. You can also type code directly into the console, but it won’t be saved anywhere. As you get more comfortable with R, it can be helpful to use the console as a “scratchpad” for experimenting and troubleshooting. For now, it’s best to err on the side of saving your code as a script so that you don’t accidentally lose useful work.

Variables and functions

Variables

Occasionally, it’s enough to just run a line of code and display the result in the console. But typically the code we write is more complex than adding one plus one, and when we execute a line of code, we want to be able to store the result and use it later in the script. This is where variables come in. Variables allow you to assign a value (whether that’s a number, a data table, a chunk of text, or any other type of data that R can handle) to a short, human-readable name. Anywhere you put a variable in your code, R will replace it with its value when your code runs.

R uses the <- symbol for variable assignment. If you’ve used other programming languages (or remember high school algebra), you may be tempted to use = instead. It will work, but there are subtle differences between <- and =, so you should get in the habit of using <-.

# the value of 12.098 is assigned to variable 'a'
a <- 12.098

# and the value 65.3475 is assigned to variable 'b'
b <- 65.3475

# we can now perform whatever mathematical operations we want using these two variables without having to repeatedly type out the actual numbers:

a*b
## [1] 790.5741
(a^b)/((b+a))
## [1] 7.305156e+68
sqrt((a^7)/(b*2))
## [1] 538.7261

In the code above, we assign the variables a and b once. We can then reuse them as often as we want. This is helpful because we save ourselves some typing, reduce the chances of making a typo somewhere, and if we need to change the value of a or b, we only have to do it in one place.

Also notice that when you assign variables, you can see them listed in your Environment tab (top right pane). Remember, everything you see in the environment is just in R’s temporary memory and won’t be saved when you close out of RStudio.

All of the examples you’ve seen so far are fairly contrived for the sake of simplicity. Let’s take a look at some code that everyone here will make use of at some point: reading data from a CSV.

Functions

It’s hard to get very far in R without making use of functions. Think of a function as a black box that takes some kind of input (the argument(s)) from the user and outputs a result (the return value). Can you identify the functions in the above examples?

anatomy of a function

Using functions - reading data from CSV

The read.csv function is used to bring in a dataset from a CSV file, and it stores the data in one of the fundamental data structures in R: a data frame. read.csv() takes the file path to the CSV as input, and it outputs a data frame containing the data from the CSV.

We’ll get very familiar with data frames in this class, but for the moment just know that it’s a table of data with rows and columns. Data frames are typically organized with rows being records or observations (e.g. sampling locations, individual critters, etc.), and columns being variables that characterize those observations (e.g., species, size, date collected, x/Y coordinates). Once you have read the data in, you can take a quick look at its structure by typing the name of the variable it’s stored in.

# read in the data from tree_growth.csv and assign it as a dataframe to the variable "tree_data"
tree_data <- read.csv("data/tree_growth.csv")

# display the 'tree_data' data frame we just created
tree_data
##               Species DBH_in Height_ft Age_yr
## 1   arboreas madeupis    4.0         6      3
## 2   arboreas madeupis    6.0        10      5
## 3   arboreas madeupis    8.0         5      6
## 4   arboreas madeupis   11.0         7      6
## 5   arboreas madeupis   12.0        16      8
## 6   arboreas madeupis   13.0        19      8
## 7   arboreas madeupis   12.0        24     11
## 8   arboreas madeupis   20.0        22     11
## 9   arboreas madeupis   20.0        32     12
## 10  arboreas madeupis   11.0        31     13
## 11  arboreas madeupis   27.0        35     15
## 12  arboreas madeupis   25.0        33     15
## 13  arboreas madeupis   30.0        40     17
## 14  arboreas madeupis   35.0        67     18
## 15  arboreas madeupis   29.0        60     21
## 16  arboreas madeupis   37.0        65     30
## 17  arboreas madeupis   31.0        78     32
## 18  arboreas madeupis   38.0        76     36
## 19  arboreas madeupis   45.0        91     37
## 20  arboreas madeupis   55.0        83     38
## 21  arboreas madeupis   60.0        88     40
## 22  arboreas madeupis   75.0        86     43
## 23  arboreas madeupis   76.0        89     51
## 24  arboreas madeupis   80.0        75     55
## 25  arboreas madeupis   99.0       100     61
## 26 planteus falsinius    9.0         6      2
## 27 planteus falsinius    9.3         9      4
## 28 planteus falsinius   10.1         8      5
## 29 planteus falsinius   10.3        14      4
## 30 planteus falsinius   10.7        12      6
## 31 planteus falsinius   10.8        10      8
## 32 planteus falsinius   11.0        10      7
## 33 planteus falsinius   11.3        11      7
## 34 planteus falsinius   11.4        13      9
## 35 planteus falsinius   11.6        13     12
## 36 planteus falsinius   11.6        15     15
## 37 planteus falsinius   11.9        23     17
## 38 planteus falsinius   12.3        16     18
## 39 planteus falsinius   12.4        21     15
## 40 planteus falsinius   12.6        23     17
## 41 planteus falsinius   12.6        18     21
## 42 planteus falsinius   12.8        25     20
## 43 planteus falsinius   15.0        35     30

Examining datasets

Compared to scrolling through a dataset in Excel, examining a dataset in R can feel unintuitive at first. Stick with it, though. Once you get used to exploring data in R, you’ll find that R will help you learn more about your dataset in a more efficient manner.

Let’s see what we can learn about the dataset as a whole. The most useful functions for this are:

  • head() Show only the first few lines of the dataframe in the console. The function is useful for taking a quick glance to ensure that column names were properly assigned, and to take a quick look a column names without printing the whole dataset.
  • summary() Show a basic statistical summary of the dataset.
  • str() Show information about the structure of the dataset, including number of rows and columns, column data types, and the first few values in each column.
  • View() In a new tab, display a spreadsheet-like view of the full dataset with options to sort and filter. Note that you can’t change the data in this view - it is read-only.
head(tree_data)  # Show the first few lines of the dataframe. Defaults to showing the first 6 rows
##             Species DBH_in Height_ft Age_yr
## 1 arboreas madeupis      4         6      3
## 2 arboreas madeupis      6        10      5
## 3 arboreas madeupis      8         5      6
## 4 arboreas madeupis     11         7      6
## 5 arboreas madeupis     12        16      8
## 6 arboreas madeupis     13        19      8
head(tree_data, n = 10)  # Show the first 10 rows
##              Species DBH_in Height_ft Age_yr
## 1  arboreas madeupis      4         6      3
## 2  arboreas madeupis      6        10      5
## 3  arboreas madeupis      8         5      6
## 4  arboreas madeupis     11         7      6
## 5  arboreas madeupis     12        16      8
## 6  arboreas madeupis     13        19      8
## 7  arboreas madeupis     12        24     11
## 8  arboreas madeupis     20        22     11
## 9  arboreas madeupis     20        32     12
## 10 arboreas madeupis     11        31     13
summary(tree_data)  # View a basic statistical summary
##    Species              DBH_in        Height_ft          Age_yr     
##  Length:43          Min.   : 4.00   Min.   :  5.00   Min.   : 2.00  
##  Class :character   1st Qu.:11.00   1st Qu.: 12.50   1st Qu.: 7.50  
##  Mode  :character   Median :12.60   Median : 23.00   Median :15.00  
##                     Mean   :24.78   Mean   : 35.35   Mean   :18.81  
##                     3rd Qu.:30.50   3rd Qu.: 62.50   3rd Qu.:25.50  
##                     Max.   :99.00   Max.   :100.00   Max.   :61.00
str(tree_data)  # Get info about the structure of the data
## 'data.frame':    43 obs. of  4 variables:
##  $ Species  : chr  "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" ...
##  $ DBH_in   : num  4 6 8 11 12 13 12 20 20 11 ...
##  $ Height_ft: int  6 10 5 7 16 19 24 22 32 31 ...
##  $ Age_yr   : int  3 5 6 6 8 8 11 11 12 13 ...
View(tree_data)  # Open a new tab with a spreadsheet view of the data

Data structures

We’re going to take a little detour into data structures at this point. It’ll all tie back in to our tree data.

The data frame we just examined is a type of data structure. A data structure is what it sounds like: it’s a structure that holds data in an organized way. There are multiple data structures in R, including vectors, lists, arrays, matrices, data frames, and tibbles (more on this unfortunately-named data structure later). Today we’ll focus on vectors and data frames.

Vectors

Vectors are the simplest data structure in R. You can think of vectors as one column of data in an Excel spreadsheet, and the elements are each row in the column. Here are some examples of vectors:

digits <- 0:9  # Use x:y to create a sequence of integers starting at x and ending at y
digits
##  [1] 0 1 2 3 4 5 6 7 8 9
is_odd <- rep(c(FALSE, TRUE), 5)  # Use rep(x, n) to create a vector by repeating x n times 
is_odd
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
shoe_sizes <- c(7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5)
shoe_sizes
##  [1]  7.0  7.5  8.0  8.5  9.0  9.5 10.0 10.5 11.0 11.5
favorite_birds <- c("greater roadrunner", "Costa's hummingbird", "kakapo")
favorite_birds
## [1] "greater roadrunner"  "Costa's hummingbird" "kakapo"

Note the use of c(). The c() function stands for combine, and it combines elements into a single vector. The c() function is a fairly universal way to combine multiple elements in R, and you’re going to see it over and over.

Let’s play around with vectors a little more. We can use is.vector() to test whether something is a vector. We can get the length of a vector with length(). Note that single values in R are just vectors of length one.

is.vector(digits)  # Should be TRUE
## [1] TRUE
is.vector(favorite_birds)  # Should also be TRUE
## [1] TRUE
length(digits)  # Hopefully this is 10
## [1] 10
length(shoe_sizes)
## [1] 10
# Even single values in R are stored as vectors
length_one_chr <- "length one vector"
length_one_int <- 4
is.vector(length_one_chr)
## [1] TRUE
is.vector(length_one_int)
## [1] TRUE
length(length_one_chr)
## [1] 1
length(length_one_int)
## [1] 1

In the examples above, each vector contains a different type of data. digits contains integers, is_odd contains logical (true/false) values, favorite_birds contains text, and shoe_sizes contains decimal numbers. That’s because a given vector can only contain a single type of data. In R, there are four data types that we will typically encounter:

  • character Regular text, denoted with double or single quotation marks (e.g. "hello", "3", "R is my favorite programming language")
  • numeric Decimal numbers (e.g. 23, 3.1415)
  • integer Integers. If you want to explicitly denote a number as an integer in R, append L to it or use as.integer() (e.g. 5L, as.integer(30)).
  • logical True or false values (TRUE, FALSE). Note that TRUE and FALSE must be all-uppercase.

There are two more data types, complex and raw, but you are unlikely to encounter these so we won’t cover them here.

You can use the class() function to get the data type of a vector:

class(favorite_birds)
## [1] "character"
class(shoe_sizes)
## [1] "numeric"
class(digits)
## [1] "integer"
class(is_odd)
## [1] "logical"

If you need to access a single element of a vector, you can use the syntax my_vector[x] where x is the element’s index (the number corresponding to its position in the vector). You can also use a vector of indices to extract multiple elements from the vector. Note that in R, indexing starts at 1 (i.e. my_vector[1] is the first element of my_vector). If you’ve coded in other languages, you may be used to indexing starting at 0.

second_favorite_bird <- favorite_birds[2]
second_favorite_bird
## [1] "Costa's hummingbird"
top_two_birds <- favorite_birds[c(1,2)]
top_two_birds
## [1] "greater roadrunner"  "Costa's hummingbird"

Logical vectors can also be used to subset a vector. The logical vector must be the length of the vector you are subsetting.

odd_digits <- digits[is_odd]
odd_digits
## [1] 1 3 5 7 9
Data frames

Let’s revisit our trees data frame. We’ve explored the data frame as a whole, but it’s often useful to look at one column at a time. To do this, we’ll use the $ syntax:

tree_data$Species
##  [1] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
##  [4] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
##  [7] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [10] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [13] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [16] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [19] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [22] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [25] "arboreas madeupis"  "planteus falsinius" "planteus falsinius"
## [28] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [31] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [34] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [37] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [40] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [43] "planteus falsinius"
tree_data$Age_yr
##  [1]  3  5  6  6  8  8 11 11 12 13 15 15 17 18 21 30 32 36 37 38 40 43 51 55 61
## [26]  2  4  5  4  6  8  7  7  9 12 15 17 18 15 17 21 20 30

You can also use square brackets [] to access data frame columns. You can specify columns by name or by index (integer indicating position of column). It’s almost always best to refer to columns by name when possible - it makes your code easier to read, and it prevents your code from breaking if columns get reordered.

The square bracket syntax allows you to select multiple columns at a time and to select a subset of rows.

tree_data[, "Species"]  # Same as tree_data$Species
##  [1] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
##  [4] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
##  [7] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [10] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [13] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [16] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [19] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [22] "arboreas madeupis"  "arboreas madeupis"  "arboreas madeupis" 
## [25] "arboreas madeupis"  "planteus falsinius" "planteus falsinius"
## [28] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [31] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [34] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [37] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [40] "planteus falsinius" "planteus falsinius" "planteus falsinius"
## [43] "planteus falsinius"
tree_data[, c("Species", "Age_yr")]  # Data frame with only Species and Age_yr columns
##               Species Age_yr
## 1   arboreas madeupis      3
## 2   arboreas madeupis      5
## 3   arboreas madeupis      6
## 4   arboreas madeupis      6
## 5   arboreas madeupis      8
## 6   arboreas madeupis      8
## 7   arboreas madeupis     11
## 8   arboreas madeupis     11
## 9   arboreas madeupis     12
## 10  arboreas madeupis     13
## 11  arboreas madeupis     15
## 12  arboreas madeupis     15
## 13  arboreas madeupis     17
## 14  arboreas madeupis     18
## 15  arboreas madeupis     21
## 16  arboreas madeupis     30
## 17  arboreas madeupis     32
## 18  arboreas madeupis     36
## 19  arboreas madeupis     37
## 20  arboreas madeupis     38
## 21  arboreas madeupis     40
## 22  arboreas madeupis     43
## 23  arboreas madeupis     51
## 24  arboreas madeupis     55
## 25  arboreas madeupis     61
## 26 planteus falsinius      2
## 27 planteus falsinius      4
## 28 planteus falsinius      5
## 29 planteus falsinius      4
## 30 planteus falsinius      6
## 31 planteus falsinius      8
## 32 planteus falsinius      7
## 33 planteus falsinius      7
## 34 planteus falsinius      9
## 35 planteus falsinius     12
## 36 planteus falsinius     15
## 37 planteus falsinius     17
## 38 planteus falsinius     18
## 39 planteus falsinius     15
## 40 planteus falsinius     17
## 41 planteus falsinius     21
## 42 planteus falsinius     20
## 43 planteus falsinius     30
tree_data[1:5, "Species"]  # First five elements of Species column
## [1] "arboreas madeupis" "arboreas madeupis" "arboreas madeupis"
## [4] "arboreas madeupis" "arboreas madeupis"

Now that we’ve learned about vectors, these data frame columns might look familiar. A data frame is just a collection of vectors of the same length, where each vector is a column.

is.vector(tree_data$Species)
## [1] TRUE
class(tree_data$Species)
## [1] "character"
is.vector(tree_data$Age_yr)
## [1] TRUE
class(tree_data$Age_yr)
## [1] "integer"
str(tree_data)  # Check out the abbreviations next to the column names: chr, num, and int. These are the data types of the columns.
## 'data.frame':    43 obs. of  4 variables:
##  $ Species  : chr  "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" "arboreas madeupis" ...
##  $ DBH_in   : num  4 6 8 11 12 13 12 20 20 11 ...
##  $ Height_ft: int  6 10 5 7 16 19 24 22 32 31 ...
##  $ Age_yr   : int  3 5 6 6 8 8 11 11 12 13 ...

Working with columns

R is great at working with vectors because they are the fundamental data structure of the language. Since data frame columns are vectors, they are typically easy to operate on.

Basic statistical summary functions aren’t too different from what you’d do in Excel.

#to calculate the mean of the 'age' column in the original dataframe:
mean(tree_data$Age_yr)
## [1] 18.81395
#or to calculate the mean of the DBH vector we created:
dbh <- tree_data$DBH_in
mean(dbh)
## [1] 24.78372
#like-wise for the median, standard deviation, and minimum:
median(tree_data$Age_yr)
## [1] 15
sd(tree_data$Age_yr)
## [1] 15.02261
min(tree_data$Age_yr)
## [1] 2

Another useful way to summarize a column is to get the unique values it contains. This is typically most useful with character columns (e.g. species) but works with any data type.

unique(tree_data$Species)  # in this line, we are printing all of the unique species names in the dataset (in this case 2).
## [1] "arboreas madeupis"  "planteus falsinius"
species_names <- unique(tree_data$Species)  # assign unique species names to a variable

Summarizing data is useful but we also want to be able to modify it. So let’s say we want to convert the ‘Height_ft’ column in tree_data from feet to inches. We’ll start by assigning a conversion factor to the variable in_per_ft.

in_per_ft <- 12  # Since there are 12 inches in a foot

We can then use this conversion factor to convert the ‘Height_ft’ column to inches, and place this converted value in a new column we will create in the data frame. Notice below that if you assign a vector to a column that doesn’t exist yet, R will automatically create that column. We’ll use the head() function to verify that the Height_in column was created correctly.

# here we are specifying the new column on the left side of the '<-', and telling R what we want put into it on the right side of the '<-'.
tree_data$Height_in <- tree_data$Height_ft * in_per_ft

# we can now use the 'head function to check our work, and make sure the proper conversion was carried out:
head(tree_data)
##             Species DBH_in Height_ft Age_yr Height_in
## 1 arboreas madeupis      4         6      3        72
## 2 arboreas madeupis      6        10      5       120
## 3 arboreas madeupis      8         5      6        60
## 4 arboreas madeupis     11         7      6        84
## 5 arboreas madeupis     12        16      8       192
## 6 arboreas madeupis     13        19      8       228

Basic data visualization

It’s often easier to get a sense of our data by visualizing it, so let’s explore the basic tools for taking a first glance at our data.

Histogram

We’ll start with a histogram. The code below generates a basic histogram plot of a specific column in the dataframe (remember from earlier that this is denoted by the $ sign) using the hist() function. We’ll start by creating a histogram for the ‘Age_yr’ column.

hist(x = tree_data$Age_yr)

Looking at the histogram, we can see that the age of most trees in the dataset fall in the 0-20 year old age class.

Scatter plot

The data frame also includes information on DBH, and height of the trees. What if we want to quickly take a look at how these two variables relate to one another? To get a basic plot, you can use the plot() function. A simple call to plot() takes two arguments: x (a vector of x coordinates) and y (a vector of y coordinates). You can learn more about plot() by typing ?plot at the console.

# this should generate an ugly but effective scatterplot of tree height vs age. It would appear that older trees are taller!
plot(x = tree_data$Age_yr, y = tree_data$Height_ft)

# let's try the same, but with DBH as the dependent variable on the y axis:
# again, we should see a plot below showing that, even in a make-believe forest, older trees of both species tend to be thicker!
plot(x = tree_data$Age_yr, y = tree_data$DBH_in)

Getting Help

Getting Help

There are a number of options to get help with R. If you’re trying to figure out how to use a function, you can type ?function_name. For example ?plot will show the R documentation for that function in the Help panel.

?plot
?dplyr::filter

If you have a question, online resources include Stackexchange and Stackoverflow are extremely helpful. Google searches are a great first step. It helps to include “in R” in every search related to R code.

Don’t hesitate to reach out to colleagues for help as well! If you are stuck on something and the answers on Google are more confusing than helpful, don’t be afraid to ask a human. Every experienced R programmer was a beginner once, so chances are they’ve encountered the same problem as you at some point. There is an R-focused data science community of practice for I&M folks, which anyone working in R (regardless of experience!) is invited and encouraged to join. The Teams join code will be shared with course participants.

Day 2: Data Wrangling

Intro and Packages

Data Wrangling Intro

Goals of this Day

  1. Learn to load and understand how packages work
  2. Understand what data wrangling is and some ways to do it.
  3. Learn how to clean your import data so that R understands what it is
  4. Get data in the format you need for analysis using the tidyverse.
  5. Combine data from different data.frames so that it is in the correct format

Disclaimer: While R provides lots of ways to check and clean up data, these tools are no replacement for sound data management that prevents data issues from happening in the first place.


Working with Packages: Installing a Package

What is a package? A package is a group of functions that someone has put together. They come with documentation that tells you about what each function does.

First, load package

  1. Go to packages tab in lower right quadrant
  2. Click on “install”
  3. Type name of package ‘janitor’
  4. Click “install”
  5. Code will pop up in your console
install.packages("janitor")
Calling a Package

To use a package, call it by using the function “library”

library('janitor')

This is called “loading a package” or “calling a package”.


Using Package

Look at the vignette

  • Go to packages window of lower right quadrant
  • Click on ‘janitor’ name in list of packages
  • Vignette will show up in the ‘help’ tab

take time to click a function from the vignette list walk through how to get helpful info

Did you get “error” text?

Text pops up in the console after you run code You must decide whether it means you have to change something or not

library('janitor')
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test


Challenge: Install a Package

Challenge: install + call the package ‘tidyverse’


Solution

Solution: install + call the package ‘tidyverse’

install.packages("tidyverse")
library('tidyverse')


Data Cleaing

Data Cleaning

Data wrangling is all the steps you have to take to prepare your data for analysis or plotting. Data munging or data cleaning are other terms used to describe this process.Material presented today can be found in Hadley Wickham’s book “R for Data Science”, which is free and online. https://r4ds.had.co.nz/wrangle-intro.html

If you’re starting from this tab of the lesson, you’ll need to load two packages before proceeding.

library('janitor')
library('tidyverse')

Major Components to Data Wrangling

  • Step 1: Import (covered yesterday)
  • Step 2+3: Tidy & Transform (get data into a useful format)


Column Names

Sometimes you get data where the column names are messy. For an example of data that needs tidying before exploration and analyses, we’ll use fake data. Copy + paste the code chunk below into your script, and run it in the console to generate a dataframe called “messy_data”.

Here, we’ll clarify that we are using the word “data.frame” or “dataframe” to describe the data. The word “tibble” also shows up. We’ll use these terms interchangeably.

messy_data <- tibble(TrailName = c('Bayside', 'Coastal', 'Tidepools', 
                                   'Monument', 'Tidepools', NA),
                     Visitor_count = c(200, 300, 400, 500, 400, NA),
                     `empty row` = NA)

This data.frame captures the hypothetical number of visitors across four trails at a park. There are a few qualities that make it difficult to interpret. Seeing this data, what would you change to make it most consistent?

messy_data
## # A tibble: 6 x 3
##   TrailName Visitor_count `empty row`
##   <chr>             <dbl> <lgl>      
## 1 Bayside             200 NA         
## 2 Coastal             300 NA         
## 3 Tidepools           400 NA         
## 4 Monument            500 NA         
## 5 Tidepools           400 NA         
## 6 <NA>                 NA NA

There are four things I suggest fixing with these data:

  1. The column names are inconsistently formatted
  2. There is an empty row
  3. There is an empty column
  4. There is a duplicated row

The Janitor package has ways to fix all of these data situations and aid in data cleaning. Let’s proceed with changing the column names.

The clean_names() function in the janitor package will consistently format column names.

clean_data <- clean_names(messy_data)
clean_data
## # A tibble: 6 x 3
##   trail_name visitor_count empty_row
##   <chr>              <dbl> <lgl>    
## 1 Bayside              200 NA       
## 2 Coastal              300 NA       
## 3 Tidepools            400 NA       
## 4 Monument             500 NA       
## 5 Tidepools            400 NA       
## 6 <NA>                  NA NA

Empty Rows and Columns

Now, we’ll deal with the empty row and empty column. To remove empty rows or columns, we can use the remove_empty() function.

The arguments for the function are as follows:

remove_empty(
  # first argument is the data
  dat, 
  # the next specifies if you want to get rid of empty rows, columns, or both
  which = c("rows", "cols"), 
  # this last one asks whether (TRUE) or not (FALSE) you want R to tell you which rows or columns were removed
  quiet = TRUE)

Let’s give it a try

Removing empty rows

clean_data2 <- remove_empty(clean_data, which = c('rows'), quiet = FALSE)
clean_data2
## # A tibble: 5 x 3
##   trail_name visitor_count empty_row
##   <chr>              <dbl> <lgl>    
## 1 Bayside              200 NA       
## 2 Coastal              300 NA       
## 3 Tidepools            400 NA       
## 4 Monument             500 NA       
## 5 Tidepools            400 NA

Removing empty rows and columns

clean_data2 <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data2
## # A tibble: 5 x 2
##   trail_name visitor_count
##   <chr>              <dbl>
## 1 Bayside              200
## 2 Coastal              300
## 3 Tidepools            400
## 4 Monument             500
## 5 Tidepools            400

The Pipe - Sequences of Functions

What happens if you’d like to both clean_names() and remove_empty() on the same data?

Solution 1 - run as a sequence

clean_data <- clean_names(messy_data)
clean_data <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data
## # A tibble: 5 x 2
##   trail_name visitor_count
##   <chr>              <dbl>
## 1 Bayside              200
## 2 Coastal              300
## 3 Tidepools            400
## 4 Monument             500
## 5 Tidepools            400

Solution 2 - use the pipe operator

Pipe operators (%>% or |>) can be read like the word “then”. For example, clean_names() %>% remove_empty() can be read as “clean names, then remove empty”. They allow for a more clean workflow.

clean_data <- clean_names(messy_data) %>%
              remove_empty(which = c('rows', 'cols'), quiet = TRUE)
clean_data
## # A tibble: 5 x 2
##   trail_name visitor_count
##   <chr>              <dbl>
## 1 Bayside              200
## 2 Coastal              300
## 3 Tidepools            400
## 4 Monument             500
## 5 Tidepools            400
# note that replacing %>% with |> also works

Note that there is no data argument in the remove_empty() function. This is because, when using a pipe, the data argument is inherited from the prior function.


Removing Duplicate Entries

A final function we’ll discuss is distinct() from the dplyr package, which is park of tidyverse. It removes duplicate rows from data.

clean_data <- clean_names(messy_data) %>%
              remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
              distinct()
clean_data
## # A tibble: 4 x 2
##   trail_name visitor_count
##   <chr>              <dbl>
## 1 Bayside              200
## 2 Coastal              300
## 3 Tidepools            400
## 4 Monument             500

A word of caution - sometimes you may not want to remove duplicate rows, or may not want to remove empty rows, because they are important to the data. You know your data best. Use functions responsibly.


Challenge

The code chunk below creates messy data, called “messy_data2”. Use functions from the Janitor package to tidy the data, and name the result “clean_data2”. See if you can use the pipe operator to streamline the script.

messy_data2 <- tibble(`Park Code` = c(NA, NA, 'CABR', 'CHIS', 'SAMO', 'SAMO'),
                     visitor_count = c(NA, NA, 400, 500, 400, 400),
                     `empty row` = NA)

Solution
clean_data2 <- clean_names(messy_data2) %>%
               remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
               distinct()

Dplyr functions

Data Wrangling

In this section, we’ll introduce functions in the dplyr package. This package is one of the packages that make up the tidyverse, so you won’t have to install and call another package. The dplyr package is a group of functions that help to manipulate and summarize data. https://www.rdocumentation.org/packages/dplyr/versions/0.7.8

You can think of these functions like verbs that accomplish tasks, and are strung together by the pipe operator (%>% or |>).

We’ll cover the following functions:

  1. select() - selects or removes columns
  2. rename() - renames columns
  3. arrange() - arranges a data.frame by column values
  4. filter() - filters data for observations that meet specific criteria
  5. mutate() - adds new column(s) to a data.frame
  6. pull() %>% table() - isolates and summarizes by a column
  7. group_by() %>% summarize() - summary calculations by groups/factor levels and returns data for each group
  8. group_by() %>% mutate() - summary calculations by groups/factor levels and returns data for each observation

The data we’ll use for modules 3-5 is from the Tidy Tuesday community on GitHub. Each week on Tuesday, the community posts data for exploratory data analysis and visualization. You can add data from this page to your RStudio environment by copying + pasting the “Get the data!” code chunk into your script and running it in the console.

https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-09-17

The data we’ll use is in csv format, so we’ll need to call the readr package, as well as the tidyverse and janitor for this lesson (if you’ve already called tidyverse and janitor, don’t run those lines).

library('readr')
library('janitor')
library('tidyverse')

Next, we’ll load the data from the R for Data Science Tidy Tuesday GitHub page. There are three lines which will load three data.frames.

  1. park_visits - which contains data on visits to US National Parks
  2. state_pop - which contains information on populations of US states
  3. gas_price - which contains information on gas prices over space and time

For now, we’ll focus on the park_visits data.frame, and will circle back to the other two later. There are descriptions of each column for each data.frame on the Tidy Tuesday website linked above if you’d like more information.

park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

Working with dplyr

Dplyr is a package that is part of the tidyverse, which is used with data.frames. It makes tasks like filtering and summarizing data easier. Many of the functions within the package are “verb-like” so they are easy to understand and remember.


dplyr - select() and rename()

select() - keeps the columns you indicate and drops the rest. Since select() puts the columns in the order you type them, you can also use select() to reorder columns.

rename() - keeps all the columns, but renames the ones you indicate

Since we won’t be using geographical information, other than state, in our module, we’ll select some columns from the park_visits data to use today.

Note that, the name to the left of the arrow is the resulting data.frame, the information to the right that’s connected by pipes are the steps to get to the result. I normally name the result something different than the original data.frame so that the original data aren’t affected. However, since we won’t need the geographical information later on, I will name this one “park_visits”.

# let's see the column names in park_visits
colnames(park_visits)
##  [1] "year"              "gnis_id"           "geometry"         
##  [4] "metadata"          "number_of_records" "parkname"         
##  [7] "region"            "state"             "unit_code"        
## [10] "unit_name"         "unit_type"         "visitors"
# we can also look at the first few rows by using head()

head(park_visits)
## # A tibble: 6 x 12
##   year  gnis_id geometry     metadata number_of_records parkname    region state
##   <chr>   <dbl> <chr>        <chr>                <dbl> <chr>       <chr>  <chr>
## 1 1904  1163670 POLYGON      <NA>                     1 Crater Lake PW     OR   
## 2 1941  1531834 MULTIPOLYGON <NA>                     1 Lake Roose~ PW     WA   
## 3 1961  2055170 MULTIPOLYGON <NA>                     1 Lewis and ~ PW     WA   
## 4 1935  1530459 MULTIPOLYGON <NA>                     1 Olympic     PW     WA   
## 5 1982   277263 POLYGON      <NA>                     1 Santa Moni~ PW     CA   
## 6 1919   578853 MULTIPOLYGON <NA>                     1 <NA>        NE     ME   
## # ... with 4 more variables: unit_code <chr>, unit_name <chr>, unit_type <chr>,
## #   visitors <dbl>
# you can either select for the columns you want
park_visits1 <- park_visits %>% select(year, parkname, unit_code, state, visitors)

# or against the columns you don't want, by using the minus sign before the column name
park_visits2 <- park_visits %>% select(-gnis_id, -geometry, -metadata, -number_of_records, -region, -unit_name, -unit_type)

# you can check if you have the same result by using colnames()
colnames(park_visits1)
## [1] "year"      "parkname"  "unit_code" "state"     "visitors"
colnames(park_visits2)
## [1] "year"      "parkname"  "state"     "unit_code" "visitors"
# to clear up our environment, we'll get rid of park_visits1 and 2 using remove()
remove(park_visits1, park_visits2)

# since we won't need all of the columns in park_visits, we'll change that file (note that park_visits is on both the left and right sides of the pipe)
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)

Taking a look at the park_visits data, there are some column name inconsistencies. For example, some are in snake case (unit_code), but some are not (parkname). To rename a column, use rename(). The structure of this argument is as follows:

rename(dataset, new_name = old_name) OR dataset %>% rename(new_name = old_name)

# rename parkname column to make it snake case
park_visits <- park_visits %>% rename(park_name = parkname)

# we can check if we've done it correctly by looking at the column names
colnames(park_visits)
## [1] "year"      "park_name" "unit_code" "state"     "visitors"

dplyr- filter()

filter() - filters out rows from a data.frame based on values in columns can combine criteria using & (and), | (or), ! (not)

If you’d like to get visitation data for just parks in the state of Washington (WA), you can use the filter function. The structure is as follows:

wa_park_visits <- filter(
                        # name of data.frame
                        park_visits, 
                        # condition
                        state == 'WA'
                        )

# head() shows the first few rows of data
head(wa_park_visits)

# you can use the unique() to check you've done the filtering correctly
unique(wa_park_visits$state)

Now that you’ve got parks in the state of Washington, perhaps you’d like to know which parks in WA had over 5 million total visitors. This will require more intensive filtering. We’ll have to filter for parks where state == ‘WA’ AND year == ‘Total’ AND visitation >= 5,000,000.

wa_park_visits <- park_visits %>%
                  # filter for parks in WA
                  filter(state == 'WA' &
                  # filter for total visitation
                  year == 'Total' & 
                  # filter for over 5 million visitors
                  visitors > 5000000)

# head() shows the first few rows of data
head(wa_park_visits)

You also might be wondering why we use “==” instead of “=”. In R, the single equal sign “=” is used to assign something. For example, if you want to make some variable equal to a number, you could tell R “x = 5”, and it would remember that x is equal to 5. To evaluate something as TRUE/FALSE, you use “==”. If you type “5 == 3” into the console, it will return the answer “FALSE”. The filtering argument uses this TRUE/FALSE evaluation - filtering for rows where the stated condition - such as year == ’Total” - is TRUE.


dplyr- challenge 1

Time for the first challenge. Using the park_visits data, create a new data.frame containing total visitation for a state of your choice. Note that some “states” aren’t states. For example, VI refers to the Virgin Islands. How can you check your work?


dplyr- solution 1
pr_park_visits <- park_visits %>%
                  # filter for parks in Puerto Rico
                  filter(state == 'PR' &
                  # filter for total visitation
                  year == 'Total')

# head() shows the first few rows of data
head(pr_park_visits)
# San Juan National Historic Site (Puerto Rico) had > 55 million visitors!

# remove result from environment
remove(pr_park_visits)

dplyr - arrange()

arrange() reorders rows. You can think of this like “sort” in Excel. While this is generally not needed in R, it can be helpful for exploring your data, and when working with time series data and lagged values.

As an example, we’ll re-visit total park visitation in WA, sorting by the number of visitors.

wa_park_visits <- park_visits %>%
                  # filter for parks in WA
                  filter(state == 'WA' &
                  # filter for total visitation
                  year == 'Total' & 
                  # filter for over 5 million visitors
                  visitors > 5000000) %>%
                  # arrange by visitation - default is ascending order
                  arrange(visitors)

# look at result (df is short so head() not needed)
wa_park_visits
## # A tibble: 9 x 5
##   year  park_name       unit_code state  visitors
##   <chr> <chr>           <chr>     <chr>     <dbl>
## 1 Total Lewis and Clark LEWI      WA      9515314
## 2 Total <NA>            NEPE      WA      9566104
## 3 Total Ross Lake       ROLA      WA     11335924
## 4 Total North Cascades  NOCA      WA     14809679
## 5 Total Fort Vancouver  FOVA      WA     20238294
## 6 Total San Juan Island SAJH      WA     23066952
## 7 Total Lake Roosevelt  LARO      WA     62247752
## 8 Total Mount Rainier   MORA      WA     94488801
## 9 Total Olympic         OLYM      WA    160737962
# to arrange by descending order, we can use desc()
wa_park_visits2 <- wa_park_visits %>%
                  arrange(desc(visitors))

# look at result 
wa_park_visits2
## # A tibble: 9 x 5
##   year  park_name       unit_code state  visitors
##   <chr> <chr>           <chr>     <chr>     <dbl>
## 1 Total Olympic         OLYM      WA    160737962
## 2 Total Mount Rainier   MORA      WA     94488801
## 3 Total Lake Roosevelt  LARO      WA     62247752
## 4 Total San Juan Island SAJH      WA     23066952
## 5 Total Fort Vancouver  FOVA      WA     20238294
## 6 Total North Cascades  NOCA      WA     14809679
## 7 Total Ross Lake       ROLA      WA     11335924
## 8 Total <NA>            NEPE      WA      9566104
## 9 Total Lewis and Clark LEWI      WA      9515314
# remove one result
remove(wa_park_visits2)

dplyr - pull()

pull() takes data from a column and returns in in vector form

pull() %>% table() is a handy way to see a summary of categorical data (e.g. how may observations for each state there are in your data)

# two ways to get the number of observations per state

# option 1 - use pull() to isolate a column and table() to get # of observations
state_n <- park_visits %>%
           # pull the state column only
           pull(state) %>%
           # get # of observations by state in table output
           table()

# option 2 - use group_by() to isolate column(s) and tally() to get # of obs
state_n2 <- park_visits %>%
           # group by state
           group_by(state) %>%
           # get tally of observations by state in output
           tally()

# remove created objects
remove(state_n, state_n2)

dplyr - mutate()

mutate() makes new columns by doing calculations on old columns

As an example, let’s calculate the total visitation for parks with over 5 million total visitors in WA in millions, by dividing the number of total visitors by 1,000,000.

wa_park_visits_millions <- wa_park_visits %>%
                           mutate(visitors_mil = visitors/1000000)

wa_park_visits_millions
## # A tibble: 9 x 6
##   year  park_name       unit_code state  visitors visitors_mil
##   <chr> <chr>           <chr>     <chr>     <dbl>        <dbl>
## 1 Total Lewis and Clark LEWI      WA      9515314         9.52
## 2 Total <NA>            NEPE      WA      9566104         9.57
## 3 Total Ross Lake       ROLA      WA     11335924        11.3 
## 4 Total North Cascades  NOCA      WA     14809679        14.8 
## 5 Total Fort Vancouver  FOVA      WA     20238294        20.2 
## 6 Total San Juan Island SAJH      WA     23066952        23.1 
## 7 Total Lake Roosevelt  LARO      WA     62247752        62.2 
## 8 Total Mount Rainier   MORA      WA     94488801        94.5 
## 9 Total Olympic         OLYM      WA    160737962       161.

dplyr - group_by() and mutate()

group_by() indicates that data is in groups, so you can do calculations separately for each group

group_by() %>% mutate() will add a column and return the same number of rows as the original data

This is helpful when you want to get group summary information without sacrificing the original values. For example, if we want to know what percent of the total visitation of a park occurs in each year, and add this to the original data.

park_visits2 <- park_visits %>%
                # remove the rows with Totals per park
                filter(year != "Total")  %>% 
                # do calculations by park
                group_by(park_name) %>%
                # add visit_percent column
                # visits fore each year divided by the sum of total visits for each park
                # the na.rm = TRUE means that NA values are not included in calculations
                # round(2) is rounds result to 2 decimal places for easier reading
                mutate(visit_percent = (100*visitors/sum(visitors, na.rm = TRUE)) %>%
                         round(2)) 

# take a look at the result
head(park_visits2)
## # A tibble: 6 x 6
## # Groups:   park_name [6]
##   year  park_name              unit_code state visitors visit_percent
##   <chr> <chr>                  <chr>     <chr>    <dbl>         <dbl>
## 1 1904  Crater Lake            CRLA      OR        1500          0   
## 2 1941  Lake Roosevelt         LARO      WA           0          0   
## 3 1961  Lewis and Clark        LEWI      WA       69000          0.73
## 4 1935  Olympic                OLYM      WA        2200          0   
## 5 1982  Santa Monica Mountains SAMO      CA      468144          2.66
## 6 1919  <NA>                   ACAD      ME       64000          0
# remove from environment
remove(park_visits2)

dplyr - group_by() and summarize()

group_by() and summarize() (or summarise()) also work with grouped data. Here you calculate a summary for each group. Output data frame will have the columns that define the group and the summary data, will have one row for each group.

I find this combination most helpful for calculating means and standard deviations for data. You can also find minimum and maximum values. We’ll filter for total visitation numbers for each park, then calculate a visitation summary for each state.

state_summary <- park_visits %>%
                # filter for total visitation numbers (to avoid double counts)
                filter(year == 'Total') %>%
                # do calculations by state
                group_by(state) %>%
                # calculate summaries
                summarize(
                  # mean number of total visitors across parks in each state
                  mean_visit = mean(visitors, na.rm = TRUE),
                  # sd of total visitors across parks in each state
                  sd_visit = sd(visitors, na.rm = TRUE),
                  # min total visitors across parks in each state
                  min_visit = min(visitors, na.rm = TRUE),
                  # max total visitors across parks in each state
                  max_visit = max(visitors, na.rm = TRUE),
                  # get number of park units w data in each state
                  n_parks = n()
                )

# take a look at the result
head(state_summary)
## # A tibble: 6 x 6
##   state mean_visit   sd_visit min_visit max_visit n_parks
##   <chr>      <dbl>      <dbl>     <dbl>     <dbl>   <int>
## 1 AK      4661539    6737110.     10490  19476522      22
## 2 AL      2998848.   2162876.    365883   5487784       5
## 3 AR     20647567.  34237894.     60563  92028933       7
## 4 AS       113694         NA     113694    113694       1
## 5 AZ     23655713.  46310344.    348811 205486894      19
## 6 CA     61322519. 120602434.      6349 611031225      26
# if you want to continue analyzing this summarized data, you must first ungroup()
# for example:
state_summary <- state_summary %>%
                 ungroup() %>%
                 mutate(dif = max_visit - min_visit)

# remove from environment
remove(state_summary)

dplyr - challenge 2

For this second challenge, you’ll start with the park_visits data. Calculate the average visitation divided by the number of parks for each state. The final data.frame should be arranged in descending order by a numeric column of your choice.


dplyr - solution 2
challenge2 <- park_visits %>%
                # filter for total visitation numbers (to avoid double counts)
                filter(year == 'Total') %>%
                # do calculations by state
                group_by(state) %>%
                # calculate summaries
                summarize(
                  # mean number of total visitors across parks in each state
                  mean_visit = mean(visitors, na.rm = TRUE),
                  # get number of park units w data in each state
                  n_parks = n()
                ) %>%
                # ungroup to do another calculation
                ungroup() %>%
                # calculate visit/n
                mutate(visit_per_park = mean_visit/n_parks) %>%
                # select relevant columns
                select(state, visit_per_park) %>%
                # arrange in descending order
                arrange(desc(visit_per_park))
                

# take a look at the result
head(challenge2)
## # A tibble: 6 x 2
##   state visit_per_park
##   <chr>          <dbl>
## 1 NV        103867788.
## 2 PR         55928192 
## 3 ME         40890226.
## 4 NC         19591218.
## 5 MS         19160525 
## 6 OK         18206483.
# remove from environment
remove(challenge2)


Pivoting Tables

Pivoting Tables

This section will go over the use of functions to change data from wide to long format and vice-versa.

We will:

  1. Explain what these formats are and why anyone cares
  2. pivot_longer()
  3. pivot_Wider()

If you’re starting with this section, run the following code chunk to get caught up.

# call packages
library("readr")
library("janitor")
library("tidyverse")

# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

# cut some columns from park_visits data
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)

Why do we care?

Depending on your goal, you may want to have your data formatted in multiple ways. For example, it is much easier for humans to read wide format data, while most R functions rely on long format data. Here is a quick demonstration using park visits data:

# we'll first take a subset of the park visits data
medn_visits <- park_visits %>%
  # get parks in MEDN and no totals, years 2010-2015
  filter(unit_code %in% c("CABR", "CHIS", "SAMO") & year != "Total") %>%
  # make year a number, since no more text
  mutate(year = as.numeric(year)) %>%
  # get years 2010:2015
  filter(year >= 2010 & year <= 2015) %>%
  # arrange in ascending order
  arrange(year) %>%
  # select relevant columns
  select(unit_code, year, visitors)

# if we take a look at the data, it is in long format, each observation is a row
# this isn't the most human-friendly way to look at data, but it is really helpful if we want to use dplyr group_by() functions like mutate() and summarize()
medn_visits
## # A tibble: 18 x 3
##    unit_code  year visitors
##    <chr>     <dbl>    <dbl>
##  1 SAMO       2010   568371
##  2 CABR       2010   763140
##  3 CHIS       2010   277515
##  4 SAMO       2011   609636
##  5 CABR       2011   813351
##  6 CHIS       2011   242756
##  7 SAMO       2012   649471
##  8 CABR       2012   877951
##  9 CHIS       2012   249594
## 10 SAMO       2013   633054
## 11 CABR       2013   856128
## 12 CHIS       2013   212029
## 13 SAMO       2014   694714
## 14 CABR       2014   893434
## 15 CHIS       2014   342161
## 16 SAMO       2015   797126
## 17 CABR       2015   981825
## 18 CHIS       2015   324816
# conversely, we can pivot longer to make the data wide format
# it is more human-friendly but you can't use dplyr functions on it easily
medn_visits_wide <- pivot_wider(medn_visits, names_from = year, values_from = visitors)
medn_visits_wide
## # A tibble: 3 x 7
##   unit_code `2010` `2011` `2012` `2013` `2014` `2015`
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 SAMO      568371 609636 649471 633054 694714 797126
## 2 CABR      763140 813351 877951 856128 893434 981825
## 3 CHIS      277515 242756 249594 212029 342161 324816

There are two functions we’ll discuss today:

  1. pivot_wider() : converts from long –> wide
  2. pivot_longer() : converts from wide –> long

pivot_wider()

Full disclosure, I almost always have to google the arguments that go into the pivot_ functions. Here’s the website for pivot_wider() https://tidyr.tidyverse.org/reference/pivot_wider.html

Here’s a short example - pivoting the MEDN visits from long to wide, with more detail than the prior code chunk.

# check out the original data (medn_visits) in long format
medn_visits
## # A tibble: 18 x 3
##    unit_code  year visitors
##    <chr>     <dbl>    <dbl>
##  1 SAMO       2010   568371
##  2 CABR       2010   763140
##  3 CHIS       2010   277515
##  4 SAMO       2011   609636
##  5 CABR       2011   813351
##  6 CHIS       2011   242756
##  7 SAMO       2012   649471
##  8 CABR       2012   877951
##  9 CHIS       2012   249594
## 10 SAMO       2013   633054
## 11 CABR       2013   856128
## 12 CHIS       2013   212029
## 13 SAMO       2014   694714
## 14 CABR       2014   893434
## 15 CHIS       2014   342161
## 16 SAMO       2015   797126
## 17 CABR       2015   981825
## 18 CHIS       2015   324816
# if we want to make each row a park and each column a year, we can pivot wider
medn_visits_wide <- medn_visits %>%
  pivot_wider(
    # names from is the thing you want to be the columns
    names_from = year,
    # values from is what you want to fill the columns with
    values_from = visitors
  )

# check out the result
medn_visits_wide
## # A tibble: 3 x 7
##   unit_code `2010` `2011` `2012` `2013` `2014` `2015`
##   <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 SAMO      568371 609636 649471 633054 694714 797126
## 2 CABR      763140 813351 877951 856128 893434 981825
## 3 CHIS      277515 242756 249594 212029 342161 324816
# we can also make park units the columns and fill it with visitor data for each year
medn_visits_wide <- medn_visits %>%
  pivot_wider(names_from = unit_code, values_from = visitors)
medn_visits_wide
## # A tibble: 6 x 4
##    year   SAMO   CABR   CHIS
##   <dbl>  <dbl>  <dbl>  <dbl>
## 1  2010 568371 763140 277515
## 2  2011 609636 813351 242756
## 3  2012 649471 877951 249594
## 4  2013 633054 856128 212029
## 5  2014 694714 893434 342161
## 6  2015 797126 981825 324816

pivot_longer()

Now that we have our data in wide format, it is time to revert back to pivoting longer. Here’s the reference material for pivot_longer() https://tidyr.tidyverse.org/reference/pivot_longer.html.

# check out the original data (medn_visits_wide) in wide format
medn_visits_wide
## # A tibble: 6 x 4
##    year   SAMO   CABR   CHIS
##   <dbl>  <dbl>  <dbl>  <dbl>
## 1  2010 568371 763140 277515
## 2  2011 609636 813351 242756
## 3  2012 649471 877951 249594
## 4  2013 633054 856128 212029
## 5  2014 694714 893434 342161
## 6  2015 797126 981825 324816
# if we want to make each row an observation, with an associated park and year, we'll pivot_longer
medn_visits_long <- medn_visits_wide %>%
  pivot_longer(
    # first argument is the columns you'd like to transform
    SAMO:CHIS,
    # next is what you'd like to name the former name cols
    names_to = "unit_code",
    # last is what you'd like to name the former values cols
    values_to = "visitors"
  )

# check out the result
medn_visits_long
## # A tibble: 18 x 3
##     year unit_code visitors
##    <dbl> <chr>        <dbl>
##  1  2010 SAMO        568371
##  2  2010 CABR        763140
##  3  2010 CHIS        277515
##  4  2011 SAMO        609636
##  5  2011 CABR        813351
##  6  2011 CHIS        242756
##  7  2012 SAMO        649471
##  8  2012 CABR        877951
##  9  2012 CHIS        249594
## 10  2013 SAMO        633054
## 11  2013 CABR        856128
## 12  2013 CHIS        212029
## 13  2014 SAMO        694714
## 14  2014 CABR        893434
## 15  2014 CHIS        342161
## 16  2015 SAMO        797126
## 17  2015 CABR        981825
## 18  2015 CHIS        324816
# this should look familiar - like the original medn_visits data

pivot challenge

This is a more involved challenge that will draw on your knowledge of both dplyr (from Module 3) and pivoting.

Using the state_pop data, accomplish the following tasks:

  1. Get data for 3 states of your choice for the years 2010-2015 (note - there is no ‘total’ values in the year column)
  2. Create a data.frame in long format
  3. Pivot wider to make a human-friendly table with states as rows and years as columns
  4. Pivot longer, returning to the original layout

Hint: if your column names are numbers, like years, you must refer to them using back-ticks. A column called 2020 would be `2020`


pivot solution
# get 3 states of choice (FL, CA, AK) for years 2010-2015
solution_original <- state_pop %>%
  filter(state %in% c("FL", "AK", "CA") &
    year >= 2010 & year <= 2015)
# pivot wider
solution_wide <- solution_original %>%
  pivot_wider(
    names_from = "year",
    values_from = "pop"
  )

# pivot longer to return to format of solution_original
solution_long <- solution_wide %>%
  pivot_longer(`2010`:`2015`, names_to = "year", values_to = "pop")



Joining Tables

Joining Tables

Oftentimes we want to use data that is split into two or more data.frames. This section will go over the use of functions to “join” or combine data from two tables into one.

We will:

  1. Use bind_rows() and bind_columns() for simple cases

  2. Used dplyr commands for more complicated joins

  1. left_join() and right join())
  2. full join()
  3. inner_join()
  1. Use filtering joins to filter data in one data.frame based on what is in another data.frame
  1. semi_join()
  2. anti_join()

If you’re starting with this section, run the following code chunk to get caught up.

# call packages
library("readr")
library("janitor")
library("tidyverse")

# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

# cut some columns from park_visits data and make sure year is nuemric data only
park_visits <- park_visits %>%
  select(year, parkname, unit_code, state, visitors) %>%
  filter(year != "Total") %>%
  mutate(year = as.integer(year))


Binding Rows and Columns

bind_rows() and bind_columns() simply merge two data.frames directly.

  1. bind_rows()

bind_rows() adds the rows in the second data.frame to the bottom of the first data.frame. This generally only makes sense if they have many identical columns. Make sure that columns with the same data have the same name!

# we'll first make two subsets of the park_visits data
SHEN_visits <- park_visits %>%
  # get data from Shenandoah.
  filter(unit_code == "SHEN")

SHEN_visits
## # A tibble: 81 x 5
##     year parkname unit_code state visitors
##    <int> <chr>    <chr>     <chr>    <dbl>
##  1  1936 <NA>     SHEN      VA      694098
##  2  2016 <NA>     SHEN      VA     1437341
##  3  2015 <NA>     SHEN      VA     1321873
##  4  2014 <NA>     SHEN      VA     1255321
##  5  2013 <NA>     SHEN      VA     1136505
##  6  2012 <NA>     SHEN      VA     1210200
##  7  2011 <NA>     SHEN      VA     1209883
##  8  2010 <NA>     SHEN      VA     1253386
##  9  2009 <NA>     SHEN      VA     1120981
## 10  2008 <NA>     SHEN      VA     1075878
## # ... with 71 more rows
ACAD_visits <- park_visits %>%
  # get data from Acadia.
  filter(unit_code == "ACAD")

ACAD_visits
## # A tibble: 98 x 5
##     year parkname unit_code state visitors
##    <int> <chr>    <chr>     <chr>    <dbl>
##  1  1919 <NA>     ACAD      ME       64000
##  2  2016 <NA>     ACAD      ME     3303393
##  3  2015 <NA>     ACAD      ME     2811184
##  4  2014 <NA>     ACAD      ME     2563129
##  5  2013 <NA>     ACAD      ME     2254922
##  6  2012 <NA>     ACAD      ME     2431052
##  7  2011 <NA>     ACAD      ME     2374645
##  8  2010 <NA>     ACAD      ME     2504208
##  9  2009 <NA>     ACAD      ME     2227698
## 10  2008 <NA>     ACAD      ME     2075857
## # ... with 88 more rows
# now we will combine the 2 data.frames into one.
SHEN_ACAD_visits <- bind_rows(SHEN_visits, ACAD_visits)

# check that data from both parks is there
SHEN_ACAD_visits %>%
  pull(unit_code) %>%
  table()
## .
## ACAD SHEN 
##   98   81
rm(SHEN_ACAD_visits, SHEN_visits, ACAD_visits)

You generally would not split up a data.frame and then put it together again like this, but it can be useful if you have data that has the same data structure but starts out broken up by site, date, park etc. and you wish to combine it into one data.frame.

  1. bind_columns()

bind_columns() adds the columns from the second data.frame to the right of the first data.frame.

# we'll first make two subsets of the park_visits data
visits_left <- park_visits %>%
  # get first 2 columns.
  select(year, parkname)

visits_left
## # A tibble: 21,174 x 2
##     year parkname              
##    <int> <chr>                 
##  1  1904 Crater Lake           
##  2  1941 Lake Roosevelt        
##  3  1961 Lewis and Clark       
##  4  1935 Olympic               
##  5  1982 Santa Monica Mountains
##  6  1919 <NA>                  
##  7  1969 <NA>                  
##  8  1967 <NA>                  
##  9  1944 <NA>                  
## 10  1989 <NA>                  
## # ... with 21,164 more rows
visits_right <- park_visits %>%
  #  get last 3 columns.
  select(unit_code, state, visitors)

visits_right
## # A tibble: 21,174 x 3
##    unit_code state visitors
##    <chr>     <chr>    <dbl>
##  1 CRLA      OR        1500
##  2 LARO      WA           0
##  3 LEWI      WA       69000
##  4 OLYM      WA        2200
##  5 SAMO      CA      468144
##  6 ACAD      ME       64000
##  7 AMIS      TX      448000
##  8 ASIS      MD      738700
##  9 BIBE      TX        1409
## 10 BICY      FL       81157
## # ... with 21,164 more rows
visits_bound <- bind_cols(visits_left, visits_right)

visits_bound
## # A tibble: 21,174 x 5
##     year parkname               unit_code state visitors
##    <int> <chr>                  <chr>     <chr>    <dbl>
##  1  1904 Crater Lake            CRLA      OR        1500
##  2  1941 Lake Roosevelt         LARO      WA           0
##  3  1961 Lewis and Clark        LEWI      WA       69000
##  4  1935 Olympic                OLYM      WA        2200
##  5  1982 Santa Monica Mountains SAMO      CA      468144
##  6  1919 <NA>                   ACAD      ME       64000
##  7  1969 <NA>                   AMIS      TX      448000
##  8  1967 <NA>                   ASIS      MD      738700
##  9  1944 <NA>                   BIBE      TX        1409
## 10  1989 <NA>                   BICY      FL       81157
## # ... with 21,164 more rows
rm(visits_left, visits_right, visits_bound)

bind_columns() assumes that the rows are in the same order in each data.frame, and that there are no rows in one data.frame that are missing in the other. Unless you are sure that these are true, you should avoid using this function.


bind_rows() challenge

Using the park_visits data, make 2 smaller data.frames:

  1. Atlantic will have just data from Puerto Rico (PR) and the Virign Islands (VI).
  2. Pacific will just have data from American Samoa (AS), Guam (GU) and Hawaii (HI)

Then, using bind_rows() combine the 2 data.frames.


bind_rows Solution
# Atlantic islands
Atlantic <- park_visits %>%
  filter(state %in% c("PR", "VI"))
# Pacific Islands
Pacific <- park_visits %>%
  filter(state %in% c("AS", "GU", "HI"))

# All islands
Islands <- bind_rows(Atlantic, Pacific)


rm(Atlantic, Pacific, Islands)


Basic Joins - left_join() and right_join()

left_join() and right_join() are a safer way to join columns from one data.frame to the columns in another.

For either of these functions, you need to specify one or more key columns using the by argument. Key columns are the data that is used to determine which rows in once data.frame match the rows in the other. The order of the rows in the data.frames does not matter.

The syntax is left_join(x,y, by="name") where x and y are data.frames.

The only difference between left_join() and right_join() is what happens when the rows in the data.frames don’t match up one to one. left_join() keeps all the rows in x and drops any rows in y that don’t match, while right_join() keeps all the rows in y and drop any rows in x that don’t match.

Let’s combine the gas_price data with the state_pop data. The gas price data does not vary by state, so we just have to match the data by year. In order to make the way these functions work a little clearer, we will first filter the state_pop data to only include data before 2000, so that there is less of an overlap between the 2 data sources.

# filter the state data
state_pop2 <- state_pop %>% filter(year < 2000)

# left_join example
state_pop_gas1 <- left_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas1
## # A tibble: 5,100 x 5
##     year state     pop gas_current gas_constant
##    <dbl> <chr>   <dbl>       <dbl>        <dbl>
##  1  1900 AL    1830000          NA           NA
##  2  1901 AL    1907000          NA           NA
##  3  1902 AL    1935000          NA           NA
##  4  1903 AL    1957000          NA           NA
##  5  1904 AL    1978000          NA           NA
##  6  1905 AL    2012000          NA           NA
##  7  1906 AL    2045000          NA           NA
##  8  1907 AL    2058000          NA           NA
##  9  1908 AL    2070000          NA           NA
## 10  1909 AL    2108000          NA           NA
## # ... with 5,090 more rows

A couple of thing to notice. The state_pop_gas1 data.frame has 5100 rows because that is what is in the state_pop2 data.frame. Every year is repeated 51 times (once for each state + DC) so the gas price data for a given year gets repeated 51 times as well. The gas_price data only goes back to 1929, so for years before that, the data just has NA for gas prices.

Here is the exact same thing, but this time using right_join()

# left_join example
state_pop_gas2 <- right_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas2
## # A tibble: 3,637 x 5
##     year state     pop gas_current gas_constant
##    <dbl> <chr>   <dbl>       <dbl>        <dbl>
##  1  1929 AL    2644000        0.21         2.38
##  2  1930 AL    2647000        0.2          2.3 
##  3  1931 AL    2649000        0.17         2.18
##  4  1932 AL    2653000        0.18         2.61
##  5  1933 AL    2661000        0.18         2.66
##  6  1934 AL    2685000        0.19         2.67
##  7  1935 AL    2719000        0.19         2.62
##  8  1936 AL    2743000        0.19         2.67
##  9  1937 AL    2762000        0.2          2.63
## 10  1938 AL    2787000        0.2          2.64
## # ... with 3,627 more rows

Now the data is very different. It has 3637 rows, which seems pretty random. What has happened is:

  1. All data pre-1929 was dropped.
  2. Data for the 71 years from 1929 to 1999 has been repeated 51 times to match the 51 states (=3621 records).
  3. The years 2000 to 2015 (15 records) each appear once as there is no matching data in the state_pop2 data.frame.

So, all the data from gas_price is present, if necessary it was repeated to match up with the state_pop2 data.


Basic Joins - inner_join() and full_join()

In the previous section we used left and right joins to get all of the data from one data.frame with matching data from a second data.frame. Now we will look at 2 other joins:

  1. inner_join() - only rows that have matches in both data.frames are kept
  2. full_join() - you get all the rows from both datasets even if some don’t match.

inner join:

# inner_join example 

state_pop_gas3 <- inner_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas3
## # A tibble: 3,621 x 5
##     year state     pop gas_current gas_constant
##    <dbl> <chr>   <dbl>       <dbl>        <dbl>
##  1  1929 AL    2644000        0.21         2.38
##  2  1930 AL    2647000        0.2          2.3 
##  3  1931 AL    2649000        0.17         2.18
##  4  1932 AL    2653000        0.18         2.61
##  5  1933 AL    2661000        0.18         2.66
##  6  1934 AL    2685000        0.19         2.67
##  7  1935 AL    2719000        0.19         2.62
##  8  1936 AL    2743000        0.19         2.67
##  9  1937 AL    2762000        0.2          2.63
## 10  1938 AL    2787000        0.2          2.64
## # ... with 3,611 more rows


Now the data.frame has 3621 rows = there are 71 years of data from 51 states. This is the smallest data.frame from these joins. The years in state_pop_gas3 are only those that are found in both data.frames.

full join:

# full_join example 

state_pop_gas4 <- full_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas4
## # A tibble: 5,116 x 5
##     year state     pop gas_current gas_constant
##    <dbl> <chr>   <dbl>       <dbl>        <dbl>
##  1  1900 AL    1830000          NA           NA
##  2  1901 AL    1907000          NA           NA
##  3  1902 AL    1935000          NA           NA
##  4  1903 AL    1957000          NA           NA
##  5  1904 AL    1978000          NA           NA
##  6  1905 AL    2012000          NA           NA
##  7  1906 AL    2045000          NA           NA
##  8  1907 AL    2058000          NA           NA
##  9  1908 AL    2070000          NA           NA
## 10  1909 AL    2108000          NA           NA
## # ... with 5,106 more rows
rm(state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4)

This data.frame has 5116 rows, there most of all the data.frames. This includes all data from either data.frame regardless of if there is a match.


Basic Joins Challenge

Using the park_visits and the state_pop data create a new data.frame with all the data from park_visits and any matching data from state_pop.

Hint if you need to use 2 columns as keys the you write the by= argument like so: by=c("Column1","Column2").


Basic Join Solution
park_visits_pop <- left_join(park_visits, state_pop, by = c("state", "year"))

rm(park_visits_pop)


Filtering Joins

Sometimes you don’t want to combine two tables, but you do want to filter one table so that it only has data that can match data in another table. semi_join() is used for this

semi join:

# semi_join example 

state_pop_gas5 <- semi_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas5
## # A tibble: 3,621 x 3
##     year state     pop
##    <dbl> <chr>   <dbl>
##  1  1929 AL    2644000
##  2  1930 AL    2647000
##  3  1931 AL    2649000
##  4  1932 AL    2653000
##  5  1933 AL    2661000
##  6  1934 AL    2685000
##  7  1935 AL    2719000
##  8  1936 AL    2743000
##  9  1937 AL    2762000
## 10  1938 AL    2787000
## # ... with 3,611 more rows

The columns in state_pop_gas4 are the same that are in state_pop2. However,all the data from year prior to 1929 has now been removed, because pre-1929 data is not in the gas_price data.

anti_join() is the opposite of semi_join(). It will keep the data one data.frame that does not have a match in the second data.frame.

anti_join:

# anti_join example 

state_pop_gas6 <- anti_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas6
## # A tibble: 1,479 x 3
##     year state     pop
##    <dbl> <chr>   <dbl>
##  1  1900 AL    1830000
##  2  1901 AL    1907000
##  3  1902 AL    1935000
##  4  1903 AL    1957000
##  5  1904 AL    1978000
##  6  1905 AL    2012000
##  7  1906 AL    2045000
##  8  1907 AL    2058000
##  9  1908 AL    2070000
## 10  1909 AL    2108000
## # ... with 1,469 more rows
# clean up
rm(state_pop2, state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4, state_pop_gas5, state_pop_gas6)

state_pop_gas_6 only has 1479 rows, because it only contains data from before 1929 when the gas_price data starts.


Filtering Joins Challenge

Filter the park_visits data so that it only contains data that can be joined with the data in and the state_pop data create a new data.frame with all the data from park_visit and any matching data from state_pop and gas_price.

Note that the state_pop data has some NAs for population for Alaska and Hawaii prior to 1949. For a bigger challenge, don’t include data those state / year combinations in your final data set.

Hint: the function is.na(data) will be TRUE when a value is NA. If you put an exclamation mark in front !is.na(data) it will tell you when the values are not NA.


Filtering Joins Solution
## In this case we filter out the NA values in the semi_join function. You could also filter them our first and save that to a new data.frame and then do the join.

park_visits_filtered<-semi_join(park_visits, state_pop %>% filter(!is.na(pop)), by=c("state", "year")) %>% 
  semi_join(gas_price, by="year")

park_visits_filtered
## # A tibble: 19,964 x 5
##     year parkname               unit_code state visitors
##    <int> <chr>                  <chr>     <chr>    <dbl>
##  1  1941 Lake Roosevelt         LARO      WA           0
##  2  1961 Lewis and Clark        LEWI      WA       69000
##  3  1935 Olympic                OLYM      WA        2200
##  4  1982 Santa Monica Mountains SAMO      CA      468144
##  5  1969 <NA>                   AMIS      TX      448000
##  6  1967 <NA>                   ASIS      MD      738700
##  7  1944 <NA>                   BIBE      TX        1409
##  8  1989 <NA>                   BICY      FL       81157
##  9  1988 <NA>                   BISO      TN      613011
## 10  1941 <NA>                   BLRI      NC      895874
## # ... with 19,954 more rows
rm(park_visits_filtered)


Day 3: Data Viz

Intro to Data Visualizations

Goals of this Day
  1. Learn rules for effective data visualizations
  2. Create simple plots, using base R functions
  3. Create customized and interactive plots, using ggplot functions
  4. Arrange multiple plots on a page, using cowplot and patchwork functions
  5. Create interactive maps, using sf and tmap functions

The Power of (Good) Data Visualizations

Data are useful only when used.
Data are used only when understood.
Consider three examples…

Example 1

Most people can understand this…

Average daily Covid cases per 100k people, by region (Sources: State and local health agencies [cases]; Census Bureau [population data]

faster than they can understand this…
Daily Covid cases and population numbers by state (only showing first 12 records)
state timestamp cases total_population
AK 2022-01-25 04:00:00 203110 731545
AL 2022-01-25 04:00:00 1153149 4903185
AR 2022-01-25 04:00:00 738652 3017804
AZ 2022-01-25 04:00:00 1767303 7278717
CA 2022-01-25 04:00:00 7862003 39512223
CO 2022-01-25 04:00:00 1207991 5758736
CT 2022-01-25 04:00:00 683731 3565287
DC 2022-01-25 04:00:00 128550 705749
DE 2022-01-25 04:00:00 241367 973764
FL 2022-01-25 04:00:00 5324438 21477737
GA 2022-01-25 04:00:00 2202292 10617423
HI 2022-01-25 04:00:00 204933 1415872

Example 2

This table shows average monthly revenue for Acme products.

Average monthly revenue (in $1000’s) from Acme product sales, 1950 - 2020
category product Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
party supplies balloons 892 1557 1320 972 1309 1174 1153 1138 1275 1178 1325 1422
party supplies confetti 1271 1311 829 1020 1233 1061 1088 1395 1376 1152 1568 1412
party supplies party hats 1338 1497 1445 956 1372 1482 1048 877 1404 1030 1458 1547
party supplies wrapping paper 1396 1026 932 891 1364 896 900 1221 1146 967 1394 1507
school supplies backpacks 1802 1773 1611 1723 1799 1730 1813 1676 1748 1652 1819 1759
school supplies notebooks 1153 1471 1541 1371 1592 1514 1725 1702 1457 1604 1729 1279
school supplies pencils 1679 1304 1054 1259 1425 1608 1972 1811 1610 1004 1417 1283
school supplies staplers 1074 1708 1439 1154 1551 1099 1793 1601 1647 1666 1389 1511

Use the table above to answer these questions:

  1. What product and month had the highest average monthly revenue?
  2. What product and month had the lowest average monthly revenue?

Table as Heat Map

Let’s display the table as a heat map, with larger numbers represented by darker color cells. Now how quickly can we answer those same two questions? What patterns can we see in the heat map that were not obvious in the table above?


Example 3

In 1973, Francis Anscombe published “Graphs in statistical analysis”, a paper describing four bivariate datasets with identical means, variances, and correlations.

Anscombe’s Quartet - Four bivariate datasets with identical summary statistics
x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89
Means and variances are identical in the four datasets. The correlation between x and y (r = 0.82) is also identical across the datasets.
x1 y1 x2 y2 x3 y3 x4 y4
mean 9 7.50 9 7.50 9 7.50 9 7.50
var 11 4.13 11 4.13 11 4.12 11 4.12

Anscombe data as plots

Despite their identical statistics, when we actually plot the data we see the four datasets are very different.

)

Anscombe’s point? TO UNDERSTAND THE DATA, WE MUST PLOT THE DATA!

Rules for Effective Data Visualizations

Anscombe used clever thinking and simple plots to demonstrate the importance of data visualizations. But it’s not enough to just plot the data. To have impact, a plot must convey a clear message. How can we do this?

  • Have a purpose. Every data visualization should tell a story. What story does the Covid plot tell? What message does Anscombe’s quartet of plots convey?

  • Consider your audience. Avoid scientific names, acronyms, or jargon unless your audience is well-versed in that language. Use color-blind friendly colors.

  • Use an appropriate visualization. For example:

    • Line graphs work well for showing changes in continuous data over time
    • Bar charts compare counts or proportions in categorical data (pie charts get a bad rap in the data viz world, but can be useful in certain situations)
    • For statistics (e.g., means) with confidence intervals, point plots with error bars are preferred over bar charts (nice explanation here)
    • Scatterplots are useful for showing the relationship (correlation) between two continuous variables.
    • Matrix heat maps can efficiently compare the magnitude of numbers when we have lots of data structured in table format, especially when colors have a clear connection to the numbers (e.g., scorecard data)
    • Box plots, violin plots, and histograms show distributions and outliers for continuous data. Dot plots are a useful alternative when sample sizes are small.

  • KEEP IT SIMPLE!!!

    • Every plot element and aesthetic should have a purpose.
    • Avoid 3D charts unless you have good reason otherwise.
    • Don’t try to cram everything into one plot (e.g., juxtapose two plots instead of adding a secondary y-axis. Nice explanation here).

  • Use informative text and arrows wisely. Clear and simple titles, subtitles, axes titles/labels, and annotations help convey the message of a plot. Use lines and arrows (sparingly but effectively) to emphasize important thresholds, data points or other plot features. Fonts should be large enough with good contrast (against the background) and sufficient white space to be easily readable.


Let’s Start Coding

For today’s data visualizations we will use the park visits data you’re already familiar with from the Data Wrangling module.

From subsets of the park visits and gas price data, we will create plots to:

  • compare the distribution of park visits among years and unit types
  • show temporal trends and unusual years in park visit patterns
  • show temporal trends in ranking of the top 5 visited park units
  • show the relationship between gas price and park visits


First steps:

  1. Load packages for today’s module
  2. Import and view the park visits and gas price data
  3. Extract a subset of the park visits data to work with
  4. Merge in the gas price data
  5. Check and clean/tidy the data
Step 1. Load packages
# This is a useful chunk of code that installs any missing packages and then loads all of them. Useful if we're sharing code with others who may have older versions of RStudio installed (more recent versions have pop-up prompts that ask if we want to install missing packages)
pkgs <- c("tidyverse", # includes dplyr, ggplot2, and other really useful packages
          "magrittr", # for the `%<>%` function to reassign piping outcomes to the original variable
          "viridis", # popular colorblind-friendly palette
          "scales", # show_col() and scale_x_date()
          "knitr") # format tables for display 
installed_pkgs <- pkgs %in% installed.packages() # check which packages are not yet installed
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE) # if some packages are not yet installed, go ahead and install them...
lapply(pkgs, library, character.only = TRUE) # ...then load all the packages and their dependencies
Step 2. Import and view the data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv") # visits to US National Parks

gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv") # gas prices over space and time

View() the data to get an initial sense of what we’re working with.

# Examine the data to understand data structure, data types, and potential problems
View(park_visits); summary(park_visits)
View(gas_price); summary(gas_price)
Step 3. Extract a subset of data

Before we run more detailed data checks, let’s extract a subset of the park_visits data to work with.

CHALLENGE: Extract a subset of the park_visits data

Create a new data frame that contains a subset of the park_visits data. Call the new data frame park_sub. We only want to include park_visits data that meet these criteria:

  • region is IM, PW or SE
  • unit_type is National Monument, National Historic Site, or National Park
  • year is in the range 2000 - 2015, inclusive


We also only need to keep a subset of the data columns from park_visits. The new data frame should have these columns (in this order): region, unit_type, unit_name, unit_code, year, visitors


Quick Tip!
Even though park_visits$year is classified as a character data type, R can sort these numbers-as-characters in numeric order. Try running sort(unique(park_visits$year)) in the console to see for yourself. This means we can filter park_visits for a subset of years "2000":"2015" without first converting park_visits$year to a numeric data type (actually we can even omit the quotes and simply filter for years in 2000:2015).

The new data frame should look like this when we type glimpse(park_sub) in the console:

## Rows: 1,919
## Columns: 6
## $ region    <chr> "PW", "SE", "IM", "IM", "PW", "IM", "PW", "PW", "PW", "IM", ~
## $ unit_type <chr> "National Historic Site", "National Historic Site", "Nationa~
## $ unit_name <chr> "Manzanar National Historic Site", "Tuskegee Airmen National~
## $ unit_code <chr> "MANZ", "TUAI", "SAND", "WABA", "CECH", "WACO", "NPSA", "CRL~
## $ year      <chr> "2000", "2003", "2010", "2002", "2013", "2015", "2002", "201~
## $ visitors  <dbl> 38010, 11085, 4063, 15374, 8157, 20551, 1938, 614712, 326376~
Step 4. Merge the gas_constant data with the park_sub data frame

CHALLENGE: Join the gas_constant data (from gas_price data frame) to park_sub, matching records by year. Call the new data frame joined_dat.

The new data frame should show the additional data column gas_constant when we type glimpse(joined_dat) in the console:

## Rows: 1,919
## Columns: 7
## $ region       <chr> "PW", "SE", "IM", "IM", "PW", "IM", "PW", "PW", "PW", "IM~
## $ unit_type    <chr> "National Historic Site", "National Historic Site", "Nati~
## $ unit_name    <chr> "Manzanar National Historic Site", "Tuskegee Airmen Natio~
## $ unit_code    <chr> "MANZ", "TUAI", "SAND", "WABA", "CECH", "WACO", "NPSA", "~
## $ year         <dbl> 2000, 2003, 2010, 2002, 2013, 2015, 2002, 2015, 2015, 201~
## $ visitors     <dbl> 38010, 11085, 4063, 15374, 8157, 20551, 1938, 614712, 326~
## $ gas_constant <dbl> 2.02, 2.01, 3.02, 1.75, 3.62, 2.45, 1.75, 2.45, 2.45, 2.4~

We now have all the data we need in a single data frame.

Step 5. Check and clean/tidy the data

We are primarily using dplyr to wrangle data and ggplot2 to plot data. These two packages are part of the Tidyverse of packages for data science. Tidyverse packages share a common design philosophy centered on ideas like breaking down complex tasks into a series of simpler functions executed in sequence (e.g., piping %>%). Functions from tidyverse packages work together most seamlessly and efficiently when data are in a tidy format. Remember that a ‘tidy’ format means data have one variable per column and one observation per row.

POLL: Which of these is ‘tidy’ data?

  1. Plants that Grow
  2. Political Twitter Likes
  3. Test Grades and Lunch
  4. A & C
  5. None of the above
  1. Plants That Grow
plant_id age_days height_cm num_leaves
1 160 21 11
2 160 19 12
3 160 32 16
4 160 23 8
1 190 24 12
2 190 20 14
  1. Political Twitter Likes
political_party Cat Food Breath Doug the Pug Jon Pigeon
Democrat 5201 4760 1200
Republican 6912 3448 800
Independent 3722 2666 5128
  1. Test Grades and Lunch
student before_lunch after_lunch
Robin 89 71
Marl 77 76
Randi 82 43

What about our joined_dat data frame? Is it in ‘tidy’ format? Yes, it is! No need to pivot_long anything today. But let’s make two final changes to the data before we begin plotting. First, we will convert region to an ordered factor with more descriptive labels. Second, we will remove park units that don’t have visitor data for every year from 2000 to 2015. We will end with a final check of the data to make sure it’s ready for plotting.

Convert region to a factor data type

Convert the variable region to a factor data type and rename and re-order the levels as Pacific West < Intermountain < and Southeast (the default order is alphabetical). This way, bar plots and color legends will order the region levels from west to east instead of alphabetically.

# NOTES:
# 1. The factor data type is similar to (and often interchangeable with) the character data type, but limits entries to pre-specified values and allows us to specify an order to those values (other than the default alphabetical order)
# 2. Factors are especially useful for ordered categories (e.g., low < medium < high) with a small number of possible values
# 3. Under the hood, `R` store factors as integer vectors representing the order of the value (with character string labels)
# 4. I usually classify categorical variables as character data types, unless I want to limit entries to a few pre-specified values, or if I want to enforce a particular order to the category levels

unique(joined_dat$region) # check what values are in `region`
## [1] "PW" "SE" "IM"
joined_dat$region <- 
  factor(joined_dat$region, 
         levels = c("PW", "IM", "SE"), # specifies the factor levels and an order for the levels. If we omit this argument, the factor levels will be the unique set of values in the column, ordered alphabetically
         labels = c("Pacific West", "Intermountain", "Southeast")) # relabels the levels with more descriptive names

# Check the data
glimpse(joined_dat) # `region` is now a factor data type
## Rows: 1,919
## Columns: 7
## $ region       <fct> Pacific West, Southeast, Intermountain, Intermountain, Pa~
## $ unit_type    <chr> "National Historic Site", "National Historic Site", "Nati~
## $ unit_name    <chr> "Manzanar National Historic Site", "Tuskegee Airmen Natio~
## $ unit_code    <chr> "MANZ", "TUAI", "SAND", "WABA", "CECH", "WACO", "NPSA", "~
## $ year         <dbl> 2000, 2003, 2010, 2002, 2013, 2015, 2002, 2015, 2015, 201~
## $ visitors     <dbl> 38010, 11085, 4063, 15374, 8157, 20551, 1938, 614712, 326~
## $ gas_constant <dbl> 2.02, 2.01, 3.02, 1.75, 3.62, 2.45, 1.75, 2.45, 2.45, 2.4~
levels(joined_dat$region) # shows the (renamed) factor levels and their order
## [1] "Pacific West"  "Intermountain" "Southeast"
str(joined_dat$region) # shows the values are coded as integers indicating factor level order
##  Factor w/ 3 levels "Pacific West",..: 1 3 2 2 1 2 1 1 1 2 ...
summary(joined_dat) # the summary for `region` now counts the number of times each factor level occurs
##            region     unit_type          unit_name          unit_code        
##  Pacific West :527   Length:1919        Length:1919        Length:1919       
##  Intermountain:965   Class :character   Class :character   Class :character  
##  Southeast    :427   Mode  :character   Mode  :character   Mode  :character  
##                                                                              
##                                                                              
##                                                                              
##       year         visitors         gas_constant  
##  Min.   :2000   Min.   :       0   Min.   :1.750  
##  1st Qu.:2004   1st Qu.:   57344   1st Qu.:2.320  
##  Median :2008   Median :  187397   Median :3.000  
##  Mean   :2008   Mean   :  602483   Mean   :2.828  
##  3rd Qu.:2012   3rd Qu.:  652166   3rd Qu.:3.610  
##  Max.   :2015   Max.   :10712674   Max.   :3.800
Remove park units with missing years of data

We will be creating plots to see if trends in park visits differ by region or unit type, so we want to exclude park units for which we don’t have complete data for the years 2000 to 2015.

Finding which park units have missing years of data is not as simple as filtering for NA’s in joined_dat because missing park unit-years are not represented by records with NA for the visitor count–the missing data are simply not in the data frame.

We could use tidyr’s expand_grid function (and then join the template data frame with joined_dat) to create the missing records with an NA entry for visitors. We would then delete any park units that have at least one NA in the visitors column. expand_grid is a useful function to know because we sometimes need to add in missing records (and, for example, fill in zero values to indicate we went to the site but found zero saplings) so we can get correct summary statistics such as means and standard deviations (that should not be treating zero counts as missing data).

In our case, we have a simpler option. We can count the unique years of data for each park unit and delete park units that have less than 16 unique years of data.

(Note that we should also check for park unit-year combinations with multiple records. If these are duplicate records, the duplicates can be deleted. If these records have different visitor counts for the same park unit-year, we would have to find out which record is the “correct” one).

length(2000:2015) # if we can't figure this out in our head, we probably need more sleep. But here is proof that there are 16 years in the range 2000 to 2015, inclusive
## [1] 16
# Check if we have multiple records for any combination of unit_code and year (e.g., ARCH should not have multiple records for the year 2016)

# Short example to show how `duplicated` works
test <- data.frame(x = c(1, 1, 3, 2, 2), y = c(7, 2, 4, 1, 1), z = 1:5)
duplicated(test[c("x", "y")])
## [1] FALSE FALSE FALSE FALSE  TRUE
test[duplicated(test[c("x", "y")]), ]
##   x y z
## 5 2 1 5
# Now use `duplicated` on our data
joined_dat[duplicated(joined_dat[c("unit_code", "year")]), ] # good, no duplicates
## # A tibble: 0 x 7
## # ... with 7 variables: region <fct>, unit_type <chr>, unit_name <chr>,
## #   unit_code <chr>, year <dbl>, visitors <dbl>, gas_constant <dbl>
# Identify park units with less than 16 years of data
remove_units <- joined_dat %>%
  count(unit_code) %>% # count() combines group_by() and summarize(n = n()) in a single function. 
  dplyr::filter(n != 16) # sort the data frame from lowest to highest count of unique years

# View(joined_dat) to check results before you remove these park units from the data

# Remove those park units from the data
viz_dat <- joined_dat %>%
  dplyr::filter(!unit_code %in% remove_units$unit_code)
  
# View(viz_dat) to make sure the park units have been removed. The final data frame should have 1856 rows and 7 columns.
# If you want to save `viz_dat` to your current working directory, run the code below. This code saves `viz_dat` as an .RDS file, which just means it saves it as an `R` object instead of as a .csv file. 

# When we save a data frame as an `R` object we retain the accompanying attribute information (e.g., the order of factor levels for `region`)--this information would not be saved in a .csv file.

saveRDS(viz_dat, "viz_dat.RDS")

# To read the RDS file from your working directory and assign it to `viz_dat`, use the code below. You can assign it to any name you want. If you're reading it from a location other than your working current directory, provide the file path in the function argument.
viz_dat <- readRDS("viz_dat.RDS")

Note that a report of our findings should explain how and why we selected this subset of data to work with.

Final data check

Always do a final check of the data to look for missing data, duplicated records, funny numbers or category levels, etc. It helps to create summary tables that decompose the data in various ways. These tables can identify oddities (e.g., category levels with only one record) and also give us an idea of the sample sizes we’ll be working with if we create plots with subsets of the data.

CHALLENGE: Create a table (data frame) that shows the number of park units in each combination of region and. Extra credit if you can reformat the table to wide instead of long format.

The new data frame (if you have it in wide format) should look like this:

## # A tibble: 3 x 4
##   unit_type              `Pacific West` Intermountain Southeast
##   <chr>                           <int>         <int>     <int>
## 1 National Historic Site              7             9        12
## 2 National Monument                  10            35         8
## 3 National Park                      17            18         7


We see from this table that we have data for 34 National Monuments in the Intermountain region and that the smallest number of park units in any region-type is 7. We’re ready to start plotting.

Plotting with Base R

# This tab requires the `viz_dat` data frame created earlier in the module. If you don't have this data frames in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument

Plots can reveal data issues that we had overlooked with our data checks. I sometimes use base R plotting functions to create quick-and-dirty plots for this purpose. If I want anything more than a very basic plot, I find ggplot functions easier to remember and use.

In addition to quick plots, base R plotting functions have another important use. Many R packages have functions that create specific data visualizations when R’s generic plot() is applied to an object of a particular class. For example, lm() is a statistical function that fits a general linear model to data and saves the model as an object of class lm. Run the code below to see what happens when we apply plot() to an object of class lm.

# NOTE: `lm` is part of the `stats` package that is automatically installed with `R`. This code comes from the Examples section of `?lm`.

# Create fake data
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)

# Simple linear regression of `weight` as a function of group
mod <- lm(weight ~ group) # assign model results to the variable `mod`
summary(mod) # model summary
class(mod) # class of `mod` is `lm`
plot(mod) # residual plots

We aren’t going to discuss statistics or residual plots today, but it’s useful to know these common uses for plot().


Feeding Data to Base R Plotting Functions

Look in the help file for the R graphics package that is automatically installed with R (in RStudio’s ‘Packages’ tab, click the hyperlink for the package called ‘graphics’). Here you will find many functions for base R graphics. For example, the boxplot() function draws box plots.

Typically, we can feed data to most base R plotting functions in one of two ways:

  1. Provide the data as a formula of the form y ~ x where y is a vector of data for the y-axis and x is a vector of data for the x-axis. For example, plot(y ~ x, data = my_data) or plot(my_data$y ~ my_data$x).

  2. Provide the vectors as separate arguments. For example, plot(x = my_data$x, y = my_data$y) When providing vectors as separate arguments, each vector must be a standalone vector. That is, we can’t just give column names and provide the data frame via a data argument.

With the non-formula approach, the argument names may not be x and y. For example, with bar plots, the argument name height is used instead of y. Refer to the help (e.g., ?barplot) for each plotting function to see what arguments it accepts.

I tend to use the formula approach for feeding data to base R plotting functions, because it is more consistent across the different types of plots. In other words, I’m too lazy to look up what argument names to use with each plot type for the non-formula approach.


Example: Box Plots

Let’s use base R functions to create a box plot of park visitors by year, for a subset of viz_dat that meets these criteria:

  • region is Intermountain
  • unit_type is National Monument
  • year is 2000, 2005, 2010, or 2015 (each year will be a separate box plot)

We want to know how much spread there is (across park units) in the number of visitors to National Monuments of the Intermountain region. We also want to know if visitor counts differed much among the four years.

# Check out `?boxplot`

# Subset the data
subdat <- viz_dat %>%
  dplyr::filter(
    year %in% c(2000, 2005, 2010, 2015),
    unit_type == "National Monument",
    region == "Intermountain")

boxplot(visitors ~ year, data = subdat) # the basic boxplot, with data provided as a formula

# An alternative way to feed the data as a formula
# boxplot(subdat$visitors ~ subdat$year) # no need for `data` argument

This plot is okay, but even for a basic plot I would want to make the y-axis labels more understandable. Because the visitor numbers are large, R is using scientific notation for the y-axis labels. So instead of 200,000 we see 2e+05. Let’s change that in the box plot and add plot and axes titles.

# Save this boxplot to `p_box` so we can look at the underlying structure
p_box <- boxplot(visitors/1000 ~ year, # so now, 200,000 will be shown as 200 on the y-axis
        data = subdat,
        main = "National Monuments in the Intermountain Region", # plot title
        xlab = "Year", # x-axis title
        ylab = "Number of visitors, in 1000's") # y-axis title

Not bad for basic box plots. Now look at the data underlying the box plots. In ?boxplot, the section titled ‘Value’ explains what these numbers mean. For any

p_box
## $stats
##         [,1]    [,2]     [,3]    [,4]
## [1,]   3.131   2.919   4.3500   9.492
## [2,]  61.170  53.521  52.5660  53.165
## [3,] 112.611 103.570 103.8875  94.931
## [4,] 258.306 280.068 221.0830 222.723
## [5,] 550.657 505.158 470.9210 416.635
## 
## $n
## [1] 34 34 34 34
## 
## $conf
##          [,1]      [,2]      [,3]      [,4]
## [1,]  59.1935  42.18307  58.22483  48.98625
## [2,] 166.0285 164.95693 149.55017 140.87575
## 
## $out
##  [1] 789.131 848.348 622.320 830.253 525.831 578.554 827.247 793.601 497.506
## [10] 478.833 813.686 588.006
## 
## $group
##  [1] 1 1 2 2 3 3 3 4 4 4 4 4
## 
## $names
## [1] "2000" "2005" "2010" "2015"

From the boxplots and numbers, it looks like visitation was generally down in 2010 and 2015 compared to 2000 and 2005. The spread of visitor counts across these 34 National Monuments is huge, though.

The code below extracts the minimum and maximum visitor counts in subdat for the year 2000.

range(subdat$visitors[subdat$year==2000]) 
## [1]   3131 848348

For the Intermountain region in the year 2000, the number of visitors (N = 848348) to the most-visited National Monument was 270 times the number of visitors (N = 3131) to the least-visited National Monument!


Modifying Base R Plots

Type ?par in the console to see the many graphical parameters (arguments) that can be used to modify base R plots. Unless you plan to use base R functions to create complex plots, you can probably get by with remembering a small handful of useful graphical parameters, such as pch = to change scatterplot point types, col = to set color, and a variety of cex arguments to change font size and point size.

Let’s create a basic scatterplot and use some of these graphical parameters to gussy up the plot.


Example: Scatterplots

CHALLENGE: Create a scatterplot showing the relationship between park visitor count (y-axis) and gas price (x-axis) for Arches National Park (unit code is ARCH).

We want to know if annual visitation to Arches National Park is correlated with gas price (annual national average).

The data for this plot should be the 16 records of data for Arches National Park (one data point per year). Assign this subset of data to the name ‘arch_dat’ because we will be using it again for later plots.

Make these modifications to the default scatterplot:

  • Show visitor count as 1000’s of visitors (i.e., divide the visitor count by 1000) as we did for the box plots.

  • Use the graphical parameter pch to change the point type from the default open circle to a solid circle (see ?points to find the number corresponding to a filled circle. The examples at the end of ?points show how to use pch as a plot argument)

  • Increase point size to 1.5 instead of the default of 1

Remember, the Internet is a great source of help when coding in R!

The final plot should show this relationship:

This plot shows that annual visitor counts increased as gas prices increased. That’s not the relationship we might have expected to see. But correlation is not causation, and the resolution of data influences the pattern we see. Maybe if we had used average June - July gas prices within 330 km of Arches National Park (instead of annual national averages) we might have seen a different relationship with visitor counts. Maybe, maybe not.

The scatterplot shows another interesting bit of information. What is that funny point that is marching to its own beat?! Gas price was low (~2.50 per gallon in 2015 dollars) and the visitor count was really high, ~ 1.4 million. A quick look at the data shows that this funny point was for the year 2015.

It would be interesting to see if 2015 was a high visitation year for other park units as well. We will save that question for ggplot.

For now, let’s see if we gain any additional insight from plots showing trends in visitor counts at Arches National Park and trends in national gas prices. In the plots below, the data point for year 2015 is shown in red. To do that, I added this argument to the plot function: col = ifelse(arch_dat$year == 2015, "red", "black")

The plot above shows that from 2004 to 2015, visitor numbers at Arches National Park increased every year. Visitation REALLY jumped in 2014 and 2015.

The plot above shows that gas prices have gone up and down between 2000 and 2015. There was a huge drop in gas price between 2008 and 2009 and again between 2014 and 2015.

These scatterplots give interesting insight on trends in national gas prices and trends in visitation at Arches National Park. The pattern we saw in our initial scatterplot–positive correlation between annual visitation and national average gas price–was almost certainly a spurious correlation. It’s important to look at data from many different directions and to not jump to conclusions based on a narrow exploration of the data.

Example: Line Graph

Line graphs are often used for showing changes in continuous data over time. Let’s make one last plot in base R before we move on to ggplot.

CHALLENGE: Re-create the plot of trends in visitation at Arches National Park, using a line graph instead of a scatterplot. If you’re feeling ambitious, make it a line graph with overplotted points.

The final plot (with overplotted points) should show this relationship:

Building a Basic ggplot()

# This tab requires the `viz_dat` and `arch_dat` data frames created earlier in the module. If you don't have these data frames in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument

arch_dat <- subset(viz_dat, unit_code == "ARCH") # subset of data that is for Arches National park

If you search the Internet for information on plotting in R, you will quickly learn that ggplot is the most popular R package for plotting. It takes a little effort to learn how the pieces of a ggplot() object fit together. But once you get the hang of it, you will be able to create a large variety of attractive plots with just a few lines of R code.

Reference documents and cheatsheets can be very helpful while you are learning ggplot().

We will build our first ggplot() step-by-step to demonstrate how each component contributes to the final plot. Our first ggplot() will re-create the line graph of trends in visitation at Arches National Park.

Step 1. Create a “Master” Template for the Plot
  • identify the data (must be a data frame or tibble) [REQUIRED]
  • set default aesthetic mappings

An aesthetic mapping describes how variables in the data frame should be represented in the plot. Any column of data that will be used to create the plot must be mapped to an aesthetic. Examples of aesthetics include x, y, color, fill, point size, point shape, and line type.

From the ‘arch_dat’ data we will map ‘visitors’ (in 1000’s) to the y variable and ‘year’ to the x variable. This will set up default axes and axes titles for the plot.

Master template with default aesthetics
# NOTES:
# 1. The first argument provided to `ggplot()` is assumed to be the data frame, so it's not necessary to include the argument name `data =`.
# 2. We are assigning the plot to a variable `p` so we can build on this `ggplot()` master template in the next step.
# 3. The parentheses around the line of code is a shortcut for `print`. Without it, the plot would assign to `p` but not print in the plots pane.

(p <- ggplot(data = arch_dat, aes(x = year, y = visitors/1000)))

summary(p) # summary of the information contained in the plot object
## data: region, unit_type, unit_name, unit_code, year, visitors,
##   gas_constant [16x7]
## mapping:  x = ~year, y = ~visitors/1000
## faceting: <ggproto object: Class FacetNull, Facet, gg>
##     compute_layout: function
##     draw_back: function
##     draw_front: function
##     draw_labels: function
##     draw_panels: function
##     finish_data: function
##     init_scales: function
##     map_data: function
##     params: list
##     setup_data: function
##     setup_params: function
##     shrink: TRUE
##     train_scales: function
##     vars: function
##     super:  <ggproto object: Class FacetNull, Facet, gg>
p$data # the plot object is self-contained. The underlying data frame is saved a list element in the plot object. 
## # A tibble: 16 x 7
##    region        unit_type     unit_name   unit_code  year visitors gas_constant
##    <fct>         <chr>         <chr>       <chr>     <dbl>    <dbl>        <dbl>
##  1 Intermountain National Park Arches Nat~ ARCH       2015  1399247         2.45
##  2 Intermountain National Park Arches Nat~ ARCH       2014  1284767         3.4 
##  3 Intermountain National Park Arches Nat~ ARCH       2013  1082866         3.62
##  4 Intermountain National Park Arches Nat~ ARCH       2012  1070577         3.8 
##  5 Intermountain National Park Arches Nat~ ARCH       2011  1040758         3.75
##  6 Intermountain National Park Arches Nat~ ARCH       2010  1014405         3.02
##  7 Intermountain National Park Arches Nat~ ARCH       2009   996312         2.58
##  8 Intermountain National Park Arches Nat~ ARCH       2008   928795         3.61
##  9 Intermountain National Park Arches Nat~ ARCH       2007   860181         3.16
## 10 Intermountain National Park Arches Nat~ ARCH       2006   833049         3   
## 11 Intermountain National Park Arches Nat~ ARCH       2005   781670         2.74
## 12 Intermountain National Park Arches Nat~ ARCH       2004   733131         2.32
## 13 Intermountain National Park Arches Nat~ ARCH       2003   757781         2.01
## 14 Intermountain National Park Arches Nat~ ARCH       2002   769672         1.75
## 15 Intermountain National Park Arches Nat~ ARCH       2001   754026         1.91
## 16 Intermountain National Park Arches Nat~ ARCH       2000   786429         2.02
Step 2. Add Geometry Layer(s)

The geometry layer determines how the data will be represented in the plot. We can generally think of this as the plot type. For example, geom_hist() creates a histogram and geom_line() creates a line graph.

If we had set default aesthetic mappings, the geometry layer will adopt those mappings unless additional (overriding) aesthetics are set in the layer itself. If we had not set default aesthetic mappings, they will need to be set within the layer. At a minimum we need to specify which data frame columns hold the x and (for most plot types) y aesthetics. Except for these two aesthetics, ggplot() will use its default values for any aesthetics not defined by the user.

Each geometry layer has its own set of accepted aesthetics. For example, the aesthetic linetype cannot be used in a geom_point() (point plot) layer. We cover plot aesthetics in more detail in the tab titled ‘Customizing a ggplot()’.

Add a line graph geometry layer
# NOTE: We combine `ggplot()` layers with `+`, not with `%>%`. With a `ggplot()` object we are not piping sequential results but rather layering pieces. In this example we are building our `ggplot()` object in a particular order, but we could actually reorder the pieces in any way (as long as the master template is set first). The only difference would be that if different layers have conflicting information, information in later layers overrides information in earlier layers.

p + geom_line()
Example of aesthetics set in the geometry layer

If we had NOT set the x and y aesthetics in the plot’s master template we could have set them within the geometry layer like this:

# Example of setting aesthetic mappings within the layer only
ggplot(data = arch_dat) +
  geom_line(aes(x = year, y = visitors/1000))
Add multiple geometry layers

We can add include multiple geometry layers in a single ggplot() object. For example, to overlay a point plot on the line graph we would add a geom_point() layer. If we had set default aesthetic mappings, this additional geometry would adopt those mappings unless additional (overriding) aesthetics were set in the geom_point() layer itself.

p + geom_line() + geom_point()

If we had only set aesthetic mappings within the geom_line() layer, we would also need to provide geom_point() with its own aesthetic mappings – there are no default aesthetics for it to adopt.

The code below generates an error message: Error: geom_point requires the following missing aesthetics: x and y

# Example of setting aesthetic mappings within the layer only
ggplot(data = arch_dat) +
  geom_line(aes(x = year, y = visitors/1000)) +
  geom_point()
Geometry layers are shortcuts for the layer() function

Geometry layers can alternatively be specified with the generic plot layer() function. Geometry layers are basically the layer() function with some pre-defined (default) arguments. For example, geom_point() is the same as layer(geom = "point", stat = "identity", position = "identity"). See ?layer for more information.

p + layer(geom = "line", stat = "identity", position = "identity") # this creates the same line graph we had created using `geom_line()`
Step 3. Use scale_...() Functions to Fine-Tune Aesthetics

All aesthetic mappings can be fine-tuned with a scale_...() function. For example, the aesthetic mapping for x can be modified with scale_x_continuous() if the data are continuous, scale_x_discrete() if the data are categorical, and scale_x_date() if the data class is data/time. ggplot provides default scale arguments for all aesthetic mappings. If we don’t explicitly define a scale for an aesthetic, ggplot will use its default scale settings.

Modify the y-axis with scale_y_continous()
# NOTES:
# 1. Our plot code is getting long, so we will separate out the components to improve readability
# 2. We will assign this plot object to a new variable, `p2` so we can just add to `p2` in the next step instead of re-typing all these lines
# 3. `ggplot()` also provides shortcuts for some scaling arguments. For example, if we just want to set y-axis limits we can add the layer `ylim (600, 1500)` instead of setting limits via `scale_y_continuous()`

# Check out `?scale_y_continuous` for ways to modify the y-axis
(p2 <- p + 
  geom_line() + 
  geom_point() +
  scale_y_continuous(
    name = "Number of visitors, in 1000's", # y-axis title
    limits = c(600, 1500), # minimum and maximum values for y-axis
    breaks = seq(600, 1500, by = 100)) # label the axis at 600, 700, 800... up to 1500
)
Step 4. Use labs() to Specify Plot Labels

This useful layer allows us to set various plot labels, such as the plot title and subtitle, x- and y-axis titles, and alternative text and plot tags. Some of these labels can be set in other ways rather than via labs(). For example, we had already set the y-axis title in the first argument of scale_y_continous().

If we want to remove a default label from the plot, we can set the argument to NULL. For example, labs(x = NULL) removes the default x-axis title.

Add plot labels
(p3 <- p2 + 
  labs(
    title = "Arches National Park Visitation, 2000 - 2015", # plot title
    subtitle = "Visitor counts have increased each year since 2004", # plot subtitle
    x = "Year") # x-axis title. We had already set the y-axis title with `scale_y_continous()`.
 )
Step 5. Set a Complete Theme and Plot Element Themes

Use theme() to customize all non-data components of the plot. Conveniently, ggplot() provides complete themes that modify many element themes at once. The default complete theme is theme_grey. See ?theme for all the ways we can modify the non-data plot elements. This help page also provides a hyperlink to a help page for complete themes (I’m sure there is a way to get there more directly, but I couldn’t figure it out).

Explore ggplot() complete themes
p3 + 
  theme_minimal() # try out different themes! theme_classic(), theme_void(), theme_linedraw()... 
Create a FiveThirtyEight-inspired plot

For more fun, install and load the ggthemes package, which allows us to apply plot styles inspired by the Economist and the popular FiveThirtyEight website, among other choices.

p3 + 
  ggthemes::theme_fivethirtyeight()
Modify theme elements

In addition to setting complete themes, we can individually modify element themes. If the element themes layer is set AFTER the complete themes layer, the element themes will override any conflicting default arguments in the complete theme. Some of these element these can alternatively be set in a scale_...() function. For example, to move the y-axis to the right side of the plot, use scale_y_continuous(position = "right") or theme()

It’s a bit difficult to understand the help page for ?themes. I usually resort to an Internet search to figure out how to set an element theme.

p3 + 
  ggthemes::theme_fivethirtyeight() +
  theme(
    plot.subtitle = element_text(color = "red", size = 14, hjust=0.5)) # change plot subtitle font color and size, and center the subtitle
Step 6. Split a Plot into Many Subplots

We can use facet_grid() or facet_wrap() to split a ggplot() object into multiple subplots, each representing a different group level. This is a nice way to simplify data-heavy plots with a single line of code.

With facet_grid we can specify different grouping variables for a row of plots versus a column of plots. For example, we can say that each row of subplots should represent a different region and each column of subplots should represent a different unit type. With facet_wrapped, subplots are “wrapped” in order from left-to-right and top-to-bottom, just like text is wrapped on a page.

Create a faceted plot
# NOTES:
# 1. For this plot we need to use the `viz_dat` dataset so we can facet by park unit name
# 2. In the `facet_wrap` arguments, `ncol = 1` sets a single column so plots are stacked on top of each other and `scales = "free_y"` allows the y-axis range to differ across the subplots (this is useful if the range of y-axis values is very different across subplots and we are more interested in comparing trends than in comparing absolute numbers among the subplots)

subdat <- subset(viz_dat, unit_code %in% c("ARCH", "GRCA", "YOSE")) # plot data for Arches National Park (ARCH), Grand Canyon National Park (GRCA), and Yosemite National Park (YOSE)

ggplot(subdat, aes(x = year, y = visitors/1000)) + 
  geom_line() + 
  geom_point() +
  labs(
    title = "National Park Visitation, 2000 - 2015",
    x = "Year", 
    y = "Number of visitors, in 1000's") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5)) + # center the plot title
  facet_wrap(vars(unit_name), ncol = 1, scales = "free_y")

We were wondering if 2015 was a very high visitation year for parks other than Arches National Park. It looks like it was for Grand Canyon and Yosemite National Parks too!

Summary of Useful ggplot() Components

These are not the ONLY ggplot() components, but they are the ones most useful to know (IMHO)

  • Data - a data frame (or tibble). Specified as the first ggplot() argument. Any aesthetics set at this level are part of the “master” template for the plot.

  • Aesthetic mapping - the variables represented in the plot (and how), e.g., x, y, color, fill, size, shape, line type. These aesthetics, except x & y, can also be set to a constant value by specifying outside the aes(). Aesthetics can be defined in the master template or in a geometry layer.

  • Geometry layer - the plot/mark layers, e.g., geom_point, geom_line, geom_boxplot. Can specify multiple such layers in a plot.

  • Scales - used to fine-tune aesthetics. Takes the form ‘scale_xxx_yyy’, where ‘xxx’ refers to the aesthetic and ‘yyy’ how the aesthetic values will be specified, e.g., discrete, continuous, manual. The associated legend for the aesthetic can also be modified here.

  • Labs - sets the plot labels, including plot title, subtitle, caption, and axis labels. Importantly, this is also where you would set alternative text and plot tags, for Section 508 Compliance.

  • Themes - complete themes (e.g., theme_bw, theme_classic) can be used to control all non-data display. Themes can also be used to more specifically modify aspects of the panel background, legend elements, label text size and color, etc.

  • Facets - subset the data by grouping variables and generate a separate subplot for each grouping variable, putting the graphs all on the same page.

Vizualizing Spatial Data

GIS in R

If you’ve tried to do GIS and make maps in R even a few years ago, you probably encountered the same frustrations I did. There were a ton of packages out there, each with its own file format and coding convention, and each package rarely had all the features I needed. It was not easy to navigate, and I often found myself just exporting my work out of R and doing the rest in ArcGIS… Enter the sf and tmap packages, which are the latest and greatest R packages devoted to GIS and map making! Most of the frustrations I had with earlier packages have been resolved with these two packages.

The sf package is the workhorse for anything you need to do with spatial vector data. File types with sf are called simple features, which follow a set of GIS standards that are becoming the universal data model for vector-based spatial data in R and that most GIS-based packages in R now employ. Simple features are also now the standard for open source GIS in general. That means if you’re trying to troubleshoot something with simple features, you’ll need to specify that it’s for R, rather than PostGIS or some other implementation. The sf package is also superseding the rgdal package, which used to be the more common data model in R and open source GIS. The more I use this package, the more impressed I am with how intuitive it is to use, and how well documented the functions are. For vector data, I have yet to need to perform a task that sf couldn’t do.

The main application of tmap package is making maps, and it was designed using a grammar of graphics philosophy similar to ggplot2. In practice for tmap, it means that maps are built as a collection of many small functions/layers that get added together, and order matters. There are also tmap-enabled functions that you can use in ggplot2 plots too, but you can do a lot more in tmap. I also prefer the look of tmap’s built-in compass, legends, etc. over the ones available in ggspatial, which is an add-on package for plotting spatial data in ggplot2.

The raster package is also an excellent package for analyzing/processing raster data. For large jobs, I find the raster package easier to work with than ESRI tools, and it tends to run a lot faster than ESRI built-in tools (I haven’t compared with python).

Finally, the leaflet package in R allows you to create interactive maps as html widgets. These are often included in R shiny apps, but they can also be used in R Markdown with HTML output (for more on that, attend next Wednesday’s session on R Markdown). An internet connection is required for the maps to be interactive. Without an internet connection the map will show the default extent defined by the code.

The leaflet package is relatively easy to use for basic mapping. For more advanced features or to customize it further, you often end up having to code in JavaScript, which is the language leaflet was originally programmed in. There’s also a lot more online help available for the JavaScript version of the leaflet library than the R version. If you’re really stumped about something, you may find your answer with the JavaScript help.


Using sf and tmap

My two favorite features of sf are 1) the attribute table of a simple feature (sf’s equivalent of a shapefile) is a dataframe in R, and 2) package functions are pipeable with tidyverse functions. That means, if you want to delete points in your layer, you can use dplyr’s filter() function to filter out those points. The sf package will update the geometry of the layer to remove the points that were filtered out.

To demonstrate the use of sf and tmap, I’m going to generate a simple GRTS sample using spsurvey, which now connects to sf instead of sp and rgdal. Then we’ll filter out points that don’t fall in a forest habitat to demonstrate how to work with sf data in R. Finally we’ll plot the results using tmap.

I wasn’t able to figure out how to post a shapefile that you could easily download in R with a url. I emailed them to you as data.zip the previous week. I also posted all of the files to our Scientists Team, which can be downloaded using the following link: https://doimspp.sharepoint.com/:f:/s/ScientistsTraining2022/EiXOTYV1l4RCk1sMr5yXhZUB1ZFEaAlAN4elvsYbBfYHRg?e=ktVy5n. To follow along, you’ll need to download (and unzip if you’re using the email attachment) the shapefiles and save them to your working directory.

Step 1. Load and prep shapefiles using sf The code chunk below has a lot going on, but bare with me. This is the easiest way I’ve found to customize the layers in tmap plots. The basic steps in the code chunk below are:
  1. Load libraries
  2. Set a seed for GRTS sample. As long as you save the seed and layers used to generate the GRTS sample, you should be able to recreate the same GRTS sample.
  3. Read in park boundary and vegetation map layers.
  4. View quick plot of layer in base R, just to make sure it looks right.
  5. Clip the veg layer to the park boundary using intersection, so map looks better.
  6. Look at the attribute tables for each layer.
  7. Add 2 columns to the veg layer to simplify the habitats (simp_veg) and set their colors (fill_col)

Once those steps are completed, we’re ready to generate a GRTS sample and start making a map. Note that I’m using 6-digit hex colors (i.e., “#ffffff” is white) to define the fill color for each habitat type. To find your own or see what color these look like, check out htmlcolorcodes.com

#install.packages(c("sf", "spsurvey"))
library(dplyr) # for filter, select, mutate and %>%  
library(sf)
library(tmap)
library(spsurvey)

# Generate a random number for the seed
sample(1:100000, 1) #62051
set.seed(62051)

# Read in shapefiles from teams data folder
sara_bound1 <- st_read("./data/SARA_boundary_4269.shp")
sara_veg1 <- st_read("./data/SARA_Veg.shp")

# Check that the projections match; fix the one that doesn't match
st_crs(sara_bound1) == st_crs(sara_veg1) # FALSE- projections don't match.
# sara_bound1 needs to be reprojected to UTM Zone 18N NAD83. 
sara_bound <- st_transform(sara_bound1, crs = 26918)
st_crs(sara_bound) == st_crs(sara_veg1) # TRUE

# Quick plot
plot(sara_bound[1])
plot(sara_veg1[1]) # bigger extent than boundary

# Intersect boundary and veg to be same extend
sara_veg <- st_intersection(sara_veg1, sara_bound)
plot(sara_veg[1])
# View attribute table of layers
head(sara_bound) # 1 feature with 95 fields

str(sara_veg)
head(sara_veg)
names(sara_veg)
table(sara_veg$ANDERSONII)

# Simplify vegetation types for easier plotting
dev <- c("1. Urban or Built-up Land", "11. Residential", 
         "12. Commercial and Services", "13. Industrial",
         "14. Transportation, Communications, and Utilities", 
         "17. Other Urban or Built-up Land")
crop <- c("21. Cropland and Pasture", 
          "22. Orchards, Groves, Vineyards, and Nurseries", 
          "31. Herbaceous Rangeland")
shrubland <- c("32. Shrub and Brush Rangeland")
forest_up <- c("41. Deciduous Forest Land", "42. Evergreen Forest Land", 
               "43. Mixed Forest Land")
forest_wet <- c("61. Forested Wetland")
open_wet <- c("62. Nonforested wetland", "62. Nonforested Wetland")
water <- c("5. Water", "51. Streams and Canals", "53. Reservoirs")
unk <- "Unclassified"

# Create 2 fields in the veg attribute table: simp_veg, and fills
sara_veg <- sara_veg %>% 
  mutate(simp_veg = case_when(ANDERSONII %in% dev ~ "Developed",
                              ANDERSONII %in% crop ~ "Open field",
                              ANDERSONII %in% shrubland ~ "Shrublands",
                              ANDERSONII %in% forest_up ~ "Forest",
                              ANDERSONII %in% forest_wet ~ "Forested wetland",
                              ANDERSONII %in% open_wet ~ "Open wetland",
                              ANDERSONII %in% water ~ "Open water",
                              ANDERSONII %in% unk ~ "Unclassified",
                              TRUE ~ "Unknown"),
         fill_col = case_when(simp_veg == "Developed" ~ "#D8D8D8",
                              simp_veg == "Open field" ~ "#f5f0b0",
                              simp_veg == "Shrublands" ~ "#F29839",
                              simp_veg == "Powerline ROW" ~ "#F9421D",
                              simp_veg == "Forest" ~ "#55785e",
                              simp_veg == "Forested wetland" ~ "#9577a6",
                              simp_veg == "Open wetland" ~ "#c497d4",
                              simp_veg == "Open water" ~ "#AFD0F2",
                              simp_veg == "Unclassified" ~ "#ffffff"))


Sidebar on simple features and shapefiles

Before moving to the next step, I wanted to show how easy it is to create simple features from dataframes that contain X/Y coordinates. We’ll read in a fake dataset in the GitHub repo for this training, and call it wetdat. The dataset contains fake species composition data for wetland monitoring sites in ACAD and includes X and Y coordinates. We’ll use dplyr functions to calculate the number of invasive and protected species in each site by year combination. Then we’ll make it a simple feature, and save it as a shapefile. Note that there are multiple formats you can save simple features as. I show the shapefile version, in case you find yourself going between R and ArcGIS.

library(dplyr)
# Import data
wetdat <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv")

# Summarize so that each site x year combination has 1 row
wetsum <- wetdat %>% group_by(Site_Name, Year, X_Coord, Y_Coord) %>% 
  summarize(num_inv = sum(Invasive), num_prot = sum(Protected), 
            .groups = 'drop')
# Check first 6 rows of output
head(wetsum)

# Convert dataframe to sf
wetsum_sf <- st_as_sf(wetsum, coords = c("X_Coord", "Y_Coord"), crs = 26919) 
  # ACAD is UTM Zone 19N NAD83, hence the difference between SARA, which is 18N.

# Write sf to disk as shapefile
st_write(wetsum_sf, "ACAD_wetland_data.shp")


Step 2. Generate GRTS Sample

The spsurvey package has been updated to point to sf instead of rgdal. It’s a code-breaking change if you have old R scripts that generated GRTS samples. However, the process is even easier now, as you can see below. Here we’re just going to use the simplest of GRTS designs. The output from grts() has multiple slots. The one we want is sites_base, and you can see how we get that slot in the code below.

Note: While I followed the approach documented in the spsurvey vignette to generate reproducable GRTS samples, it does not appear to be working as I’m testing it. Despite running set.seed(62051) after loading spsurvey and then running the code chunk below, each sample appears to be different.

sara_grts <- grts(sara_bound, n_base = 100) # generate 100 points within SARA boundary
sara_grts$sites_base$priority <- as.numeric(row.names(sara_grts$sites_base)) # add priority number (same as row.name)


Step 3. Spatial join and filter

First step here is to spatially join the sara_grts layer to the sara_veg layer. Here we only cared about the sara_veg’s simp_veg field, so I used dplyr’s select() function. After joining, you should see a simp_veg field added to the end of the grts layer (it’s actually to the left of geometry, which is the last). In the next step, we then filter the points in the newly created grts_veg layer to only include points that fall in forest habitat.

# Spatial join
grts_veg <- st_join(sara_grts$sites_base, sara_veg %>% select(simp_veg))
names(grts_veg)

# Create list of forest habitats
sort(unique(sara_veg$simp_veg))
forest_veg <- c("Forest", "Forested Wetland")

# Filter out non-forest points
grts_forest <- grts_veg %>% filter(simp_veg %in% forest_veg)
nrow(grts_veg) # 100 points
nrow(grts_forest) # fewer points


Step 4. Create map

Now it’s time to plot. The code below may seem a bit much, but we’ll break it down piece by piece. The great thing about building maps this way is that you’re essentially building a template in code that you can steal from and adapt for future maps. You can even turn these into a function that’s even easier to use in future maps. That’s beyond what we can cover here, but it’s another great benefit of making maps in R instead of ArcGIS.

The code below is broken into the various pieces that make up the map. The way tmap works is that first, you have to add the layer via tm_shape(), and then you specify how you want that layer to look, like tm_fill(), or tm_border(). Each piece has its own legend as well. This is similar to how you start each ggplot2 graph defining the data with ggplot(data, aes(x = xvar, y = yvar)), and then start adding geom_point(), or whatever else to define how the data are plotted. The difference with tmap is that every layer you want in the map has to be coded this way. Finally tm_layout() is similar to theme() in ggplot2, and is where you can customize the map layout.

A couple of notes about the code below:
  • The first line that defines for_legend makes a list of the habitat types in simp_veg and their associated colors. That saves time having to type them all over again, and potentially spelling them wrong.
  • The first tm_shape() in the map sets the projection and the bounding box. If you don’t set the bounding box, the map will show the largest extent of your layers. So if you have a roads layer at the state-level, your map will be zoomed at that extent, instead of the park boundary.
  • Note the difference between tm_fill(), which fills in colors, and tm_borders(), which only plots outlines. If you want both borders and fills, use tm_polygons() instead.
  • The z value in the tm_add_legend() allows you to set the order that each legend appears in the map. Otherwise, they’ll appear in the order you specify the layers.
  • The code for tm_symbols() allows you to change the symbol format in a similar way to ggplot2. We then added tm_text() to label each point using the numbers in the priority column of the grts_forest attribute table. The xmod and ymod allow you to offset labels from the points either horizontally and vertically. In this case, negative xmod moves the label to the left, and a negative ymod moves the label below the point. The default location for labels is directly on top of the point.
  • The tm_layout() code is where you can change the default settings of the figure, including font size, placement of the legend, and margins. The title name and position are also set in the tm_layout().
  • The tmap_save() allows you to write the map to disk and to specify the height and width of the map. I prefer to write maps to disk to see what the size looks like before changing any of the layout and font size settings, because the figure on your disk will look different (and often better) than the plot view in your R Studio session.
# Creating list of simp_veg types and their fill colors for easier legend
for_legend <- unique(data.frame(simp_veg = sara_veg$simp_veg, fill_col = sara_veg$fill_col)) 

sara_map <- 
  # Vegetation map
  tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
  tm_fill("fill_col") +
  tm_add_legend(type = 'fill', labels = for_legend$simp_veg, 
                col = for_legend$fill_col, z = 3) +

  # Park boundary
  tm_shape(sara_bound) +
  tm_borders('black', lwd = 2) +
  tm_add_legend(type = 'line', labels = "Park Boundary", col = "black",
                z = 2)+
    
  # GRTS points  
  tm_shape(grts_forest) +
  tm_symbols(shapes = 21, col = "#EAFF16", border.lwd = 0.5, size = 0.3) + 
  tm_text("priority", size = 0.9, xmod = -0.4, ymod = -0.4) +
  tm_add_legend(type = 'symbol', labels = "GRTS points", shape = 21, 
                col = "#EAFF16", border.lwd = 1, size = 0.5, 
                z = 1) +
  
  # Other map features
  tm_compass(size = 2, type = 'arrow', 
             text.size = 1, position = c('left', 'bottom')) +
  tm_scale_bar(text.size = 1.25, position = c("center", "bottom")) + 
  tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02), # make room for legend
            outer.margins = 0,
            legend.text.size = 1.25,
            legend.just = 'right',
            legend.position = c("right", "bottom"),
            title = "Saratoga NHP GRTS points",
            title.position = c("center", "top"))

tmap_save(sara_map, "SARA_GRTS.png", height = 10.5, width = 8, 
          units = 'in', dpi = 600, outer.margins = 0)


Here’s what sara_map looks like in your plot viewer:

sara_map




Here’s what the map looks like after it’s written to disk and the dimensions are set using tmap_save():

The last cool thing to show with tmap, is that you can include interactive HTML widgets similar to what leaflet can do (coming next). With the interactive mode, you can change baselayers, turn your layers off/on, and zoom to different extent. The legend in the interactive mode is a bit limited to only show fills, but it’s still pretty cool.

tmap_mode("view") # turn interactive mode on
sara_map
tmap_mode("plot") # return to default plot view


Using leaflet and parktiles

Now I’ll show you how to create a simple leaflet map including NPS parktiles as a basemap. By default, layers and basemaps that feed into leaflet need to be specified in latitude and longitude using WGS84 projection (EPSG:4326). This projection may introduce some inaccuracies if projecting UTM to WGS84, such as plots that are at the edge of park boundaries appearing outside of the boundary. It has to do with the geographic transformation used to go from UTM to WGS. Recent improvements, including the Proj4Leaflet plugin, allow you to change coordinate systems, which may improve this. For more on that see the Projections section of R Studio’s leaflet page. For the purposes here, we’ll use the default WGS84, and reproject our data to match the projection.

We’re going to use some of the SARA layers we used with tmap, but we have to reproject the layers to WGS84. First we’ll define the four different park tiles that are available. These are the same tiles you see if you view the map on a given park’s website.

# Load library
library(leaflet)

# Load park tiles
NPSbasic = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"
  
NPSimagery = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"
  
NPSslate = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"
  
NPSlight = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"

# Reproject SARA layers
sara_bnd_wgs <- st_transform(sara_bound, crs = 4326)
grts_forest_wgs <- st_transform(grts_forest, crs = 4326)

# Calculate centroid for leaflet map view
sara_cent <- st_centroid(sara_bnd_wgs)$geometry
str(sara_cent) # -73.63459, 42.99543


Now we add the pieces together. The leaflet package operates similar to tmap and ggplot2, in that you build the map by adding consecutive layers and connecting the code with pipes.

A couple of notes on the code below:
  • The setView() allows you to set the center (in WGS84) and the zoom of the map. A zoom of 0 is the entire world. The higher the zoom, the closer the zoom. You can also set the min/max zoom a user can view in a map, along with a maximum bounding box that a user can pan or zoom within.
  • addTiles() is how you add individual tiles. The group = “text” names the tiles, which are then shown in the map’s layer control via addLayersControl().
  • addCircleMarkers() is how the GRTS points are added to the map. I also had to specify the lat/lng coordinates for each point, and I used sf’s st_coordinates() to extract them.
  • The radius of the point is how you specify the size of the markers.
  • The stroke refers to the outline. I turned it off in this case.
  • I also added a scalebar and set the width and units of it using addScaleBar(), and scaleBarOptions().
  • There are all kinds of bells and whistles you can add to a leaflet plot, including popups on hover or click, custom colors and symbols based on the data, etc. For more examples, check out Leaflet for R.
sara_map_lf <-
  leaflet() %>% 
  setView(lng = -73.63, lat = 42.99, zoom = 13) %>% 
  # parktiles
  addTiles(group = "Map",
           urlTemplate = NPSbasic) %>%
  addTiles(group = "Imagery",
           urlTemplate = NPSimagery) %>%
  addTiles(group = "Light",
           urlTemplate = NPSlight) %>%
  addTiles(group = "Slate",
           urlTemplate = NPSslate) %>% 
  addLayersControl(map = .,
    baseGroups = c("Map", "Imagery", "Light", "Slate"),
    options = layersControlOptions(collapsed = T)) %>% 
  # points
  addCircleMarkers(data = grts_forest_wgs,
                   lng = st_coordinates(grts_forest_wgs)[,1],
                   lat = st_coordinates(grts_forest_wgs)[,2],
                   label = grts_forest_wgs$priority,
                   labelOptions = labelOptions(noHide=T, textOnly = TRUE, 
                      direction = 'bottom', textsize = "12px"),
                   fillColor = "#EAFF16",
                   radius = 4,
                   stroke = FALSE, # turn off outline
                   fillOpacity = 1) %>% 
  # scale bar and settings
  addScaleBar(position = 'bottomright') %>% 
  scaleBarOptions(maxWidth = 10, metric = TRUE) 
sara_map_lf
Note that I can’t figure out why I’m not able to make this map render on the website, but I’ll demonstrate it in R Studio.

Resources

Using ggplot2


Visualizing Spatial Data
  • Geocomputation in R is an excellent online book on the sf and tmap packages.
  • R Studio’s leaflet for R page shows the basics of how to use leaflet.
  • R-spatial mapping in ggplot The r-spatial.org site is a blog with lots of interesting topics covered. The blog I linked to helped me figure out how to map shapefiles in ggplot.



Day 4: Best Practices & Wrap-up

Good Coding Practices

A note about best practices

As you continue to work in R, your code projects will quickly become bigger and more complex than the examples we’ve seen this week. The larger the project, the harder it is to keep your thoughts organized when you are working on it - especially if you have to pick it back up after some time away. Fortunately, there are some things you can do to keep your code tidy and understandable.

Organize your files

As we learned earlier in this class, you should always use RStudio projects to organize your code projects. If you have a lot of files, it’s often helpful to include sub-folders within your project folder for things like raw data and final outputs. Avoid using setwd() in your code, as this will likely cause your code to break when it’s run from a different computer or user account. Instead, make sure any file paths in your code are relative to the project folder. That means that if you have a folder called data inside your project folder, you would just refer to it using the path ./data (the ‘.’ is shorthand for your current working directory). A file inside the data folder called my_raw_data.csv would have a relative path of ./data/my_raw_data.csv.

example of project folder

Comment your code

Comments in your code are a gift to your future self and others. You don’t need to explain every single line of code, but if it’s not obvious what your code is doing, it’s a good idea to leave a note. If it took you a while to figure out how to write a chunk of code, or if it was hard to get it to work, that’s a good sign that you should take the time to write some comments.

For longer scripts, you can use comments to break your code up into sections (e.g. Setup, Analysis, Plots). It’s often helpful to make an outline in comments before you even start coding.

Style your code consistently

When writing for an audience, it’s helpful to follow a style guide so that inconsistencies and poor grammar don’t distract readers from you’re trying to say. The National Park Service even has its own style guide. The same goes for code. If the “grammar” of your code is tidy and organized, it’ll be easier for you and others to focus on what the code actually does.

There are multiple R style guides out there, but the Tidyverse style guide is a good place to start. Using the styler package, you can even style your code automatically!

If you find yourself getting bogged down in the finer details of coding style, take a step back and just focus on being generally tidy and consistent. Especially with tools like styler, conforming to a style guide shouldn’t have to feel like a burden.

Be efficient*
# Retrieve the latest water quality data from the full dataset
pH_2020 <- dplyr::filter(pH_Data_All, WaterYear == 2020)
DissolvedOxygen_2020 <- dplyr::filter(DO_Data_All, WaterYear == 2020)
SpecificConductance_2020 <- dplyr::filter(SpCond_Data_All, WaterYear == 2020)

The code above should work fine, assuming that the datasets pH_Data_All, DO_Data_All, and SpCond_Data_All are defined somewhere. But let’s imagine what this code would look like if we translated it into an email. It would go something like this:

Dear Dr. Hydrologist,

I would like a copy of your pH data from 2020. Please send me the pH dataset from 2020. I need your dissolved oxygen data from 2020. Please send me the dissolved oxygen dataset from 2020. I need your specific conductance data from 2020. Please send me the specific conductance dataset from 2020.

An aspiring R programmer

Well, we certainly got the information across. But our collaborator might look at us funny the next time we see them. We only wanted data from one year, but we wrote it six times! Let’s make some minor edits.

Dear Dr. Hydrologist,

I need a copy of your most current pH, dissolved oxygen, and specific conductance data. Please send me the 2020 datasets for pH, dissolved oxygen, and specific conductance.

An aspiring R programmer

That’s a big improvement. We could keep editing but let’s call that good enough for now. Translating back into R code, we get something like this:

current_water_year <- 2020

pH_Current <- dplyr::filter(pH_Data_All, WaterYear == current_water_year)
DissolvedOxygen_Current <- dplyr::filter(DO_Data_All, WaterYear == current_water_year)
SpecificConductance_Current <- dplyr::filter(SpCond_Data_All, WaterYear == current_water_year)

Notice how the water year is only specified in one place at the top of the script, instead of being hardcoded in multiple places throughout the code. Not only does this reduce the chance of typos, it makes it much easier to update the water year next year. This doesn’t seem like a big deal in an example that’s only three lines of code, but imagine having to search through 100 or 1000 lines of code for every occurrence of the current water year or field season. Hardcoding values (e.g. field season, park code) throughout your code, especially when they appear more than once, will cause headaches down the line. Instead, assign values to variables at the top of your code so that you can keep track of them in a single place.

This is an example of the DRY (Don’t Repeat Yourself) Principle. If you find yourself hardcoding duplicate information (like the water year above) or repeatedly copy-pasting the same piece of code, it’s probably worth editing your code to make it less redundant.

*The caveat: Don’t optimize your code too early. As with most types of writing, it’s usually best to produce a working draft and then edit it until you’re satisfied with the result. Trying to get it perfect on the first try usually just leads to analysis paralysis! But the more you practice writing DRY code, the more naturally it will come.

There are a number of ways you can structure your code to make it less redundant and save yourself some work. We’ll talk more about this in Control Structures.

Vectorization

On the first day of this class, we introduced vectors. Here are some examples:

# Numeric vectors
numbers <- 1:10
other_numbers <- 6:15

# Character vectors
code_lang <- c("R", "C++", "Java", "Python", "C")
human_lang <- c("English", "Spanish", "Chinese", "Russian")

What if we want to multiply numbers by two? The brute-force method looks something like this:

new_numbers <- c()  # Create a new NULL vector

new_numbers[1] <- 2 * numbers[1]
new_numbers[2] <- 2 * numbers[2]
new_numbers[3] <- 2 * numbers[3]
new_numbers[4] <- 2 * numbers[4]
new_numbers[5] <- 2 * numbers[5]
new_numbers[6] <- 2 * numbers[6]
new_numbers[7] <- 2 * numbers[7]
new_numbers[8] <- 2 * numbers[8]
new_numbers[9] <- 2 * numbers[9]
new_numbers[10] <- 2 * numbers[10]

new_numbers
##  [1]  2  4  6  8 10 12 14 16 18 20

Obviously we’re making this harder than it needs to be. Many functions and operations in R are vectorized, meaning that they will be applied in parallel to each element of a vector. In fact, you’ve already seen examples of vectorization earlier in this class. Because data frame columns are just vectors, much of the data wrangling you’ve done so far takes advantage of vectorization. Here, we’ll talk about vectorization in a little more detail, but a lot of this may look familiar!

As you probably know, all we need to do to multiply each element in numbers by two is:

new_numbers <- numbers * 2
new_numbers
##  [1]  2  4  6  8 10 12 14 16 18 20

Much less tedious. In general, if you feel like you’re doing something by brute force in R, that’s a good indicator that there’s a better way to do it.

As you might expect, division, addition, and subtraction are also vectorized. In fact, this even works with two or more vectors.

numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
other_numbers
##  [1]  6  7  8  9 10 11 12 13 14 15
other_numbers - numbers
##  [1] 5 5 5 5 5 5 5 5 5 5
other_numbers * numbers
##  [1]   6  14  24  36  50  66  84 104 126 150

So far we’ve only looked at operations between vectors and single values or two vectors of the same length. But what happens if we try to add two vectors of different lengths? Let’s try it out.

ones <- rep(1, 6)  # Vector of 1's, length 6
ones
## [1] 1 1 1 1 1 1
add_this <- c(1, 2)
this_one_breaks <- c(1, 2, 3, 4)

ones + add_this  # This works fine
## [1] 2 3 2 3 2 3
ones + this_one_breaks  # This generates a warning
## Warning in ones + this_one_breaks: longer object length is not a multiple of
## shorter object length
## [1] 2 3 4 5 2 3

What happened here? When we added the vector [1, 2] to the vector [1, 1, 1, 1, 1, 1], we got [2, 3, 2, 3, 2, 3]. When we added [1, 2, 3, 4] to [1, 1, 1, 1, 1, 1], we got a result, but we also got a warning along with it. The text of the warning actually gives a clue as to what R is doing. It says “longer object length is not a multiple of shorter object length” because when R tries to operate on vectors of different lengths, it recycles the shorter vector until it matches the length of the longer vector. ones is a vector of length 6. Since add_this is length 2 and 2 is a multiple of 6, R adds [1, 1, 1, 1, 1, 1] + [1, 2, 1, 2, 1, 2], repeating add_this exactly three times. However, this_one_breaks is a vector of length 4, which is not a multiple of 6. Therefore, we get a warning because R cannot fit this_one_breaks into a vector of length 6 without truncating it. Operating on vectors of different lengths is often useful. However, it’s easy to make mistakes if you forget that vectors in R work this way! Keep this in mind when you are debugging.

As you may remember from working with data frames, lots of operations in R are vectorized, not just basic arithmetic. This includes comparison operators (<, >, ==, <=, >=, !=), logical operators (&, |, !), and many functions (is.na(), gsub(), grep(), and too many others to list).

Here’s another example. It’s fairly common to end up with data that was entered as text and has extra whitespace:

messy_data <- tibble::tibble(plot_id = c("1 ", " 2 ", " ", " 4", "5."),
                             year_monitored = c(2019, 2019, 2019, 2020, 2020))
messy_data
## # A tibble: 5 x 2
##   plot_id year_monitored
##   <chr>            <dbl>
## 1 "1 "              2019
## 2 " 2 "             2019
## 3 " "               2019
## 4 " 4"              2020
## 5 "5."              2020

The dplyr way to fix this might look something like this:

clean_data <- messy_data %>% dplyr::mutate(plot_id = trimws(plot_id),  # Get rid of leading and trailing spaces
                                           plot_id = gsub(pattern = ".", replacement = "", x = plot_id, fixed = TRUE)) %>%  # Replace the stray "." with nothing
  dplyr::filter(plot_id != "") %>%
  dplyr::mutate(plot_id = as.integer(plot_id))
clean_data
## # A tibble: 4 x 2
##   plot_id year_monitored
##     <int>          <dbl>
## 1       1           2019
## 2       2           2019
## 3       4           2020
## 4       5           2020

trimws, gsub, and != are all vectorized operations being applied to the column plot_id. Let’s see what’s happening to just the plot_id column step by step under the hood.

plot_id_vec <- messy_data$plot_id
plot_id_vec
## [1] "1 "  " 2 " " "   " 4"  "5."
plot_id_vec <- trimws(plot_id_vec)
plot_id_vec
## [1] "1"  "2"  ""   "4"  "5."
plot_id_vec <- gsub(pattern = ".", replacement = "", x = plot_id_vec, fixed = TRUE)
plot_id_vec
## [1] "1" "2" ""  "4" "5"
id_is_valid <- plot_id_vec != ""
id_is_valid
## [1]  TRUE  TRUE FALSE  TRUE  TRUE
plot_id_vec <- plot_id_vec[id_is_valid]
plot_id_vec
## [1] "1" "2" "4" "5"
plot_id_vec <- as.integer(plot_id_vec)

Vectorization in R is a powerful tool. However, not everything is vectorized. One common example is read.csv(). read.csv() takes the path to a .csv file and returns a data frame of the data in the csv. But what if the data is stored across multiple .csv files? read.csv(c(file_path_1, file_path_2, file_path_3)) doesn’t work because read.csv isn’t vectorized. Fortunately, vectorization isn’t the only way to avoid repeating yourself in R.

Control Structures

Control Structures

What are control structures? The simplest way to think about this is that these are programming elements that control the flow of data in your code depending on the value of logical conditions. There are six basic control strucutres and you can find them with help("Control").

There are two control structures we are going to focus on: for loops and if/else statements. for loops are a control structure that allow you to repeat a section of code multiple times (without copy-pasting it over and over!). if/else statements control whether or not a section of code is run based on certain conditions. If we have time we are going to touch on while at the end, mostly for fun.

if/else

if/else Statements

Most of you are probably familiar with this. It has 3 pieces (in order):

  1. A conditional statement (“if”) to evaluate as a (T/F), e.g. (x>1)
  2. An action to perform if it is TRUE
  3. An action to perform if it is FALSE

On any individual data point or set of data, only two of these code pieces will be applied. The conditional statement (1) will always be evaluated. Then only the appropriate T or F action (2 or 3) will be applied - the other will be skipped. There are a two different ways to invoke this in R and they do different things:

This is the ‘vectorized’ version - you might use this to recode data.

ifelse(test= , yes= , no= ) (hereafter ifelse)

This is not vectorized but is very useful where you need a ‘switch’. This can evaluate complex data like a dataframe, but only as long as the output is a single boolean (TRUE or FALSE).

if(...){}else{} (hereafter if/else)

Let’s see how ifelse works to recode some data. Let me build this up starting from the data:

d <- c(1:4, 100*1:4)
d
## [1]   1   2   3   4 100 200 300 400

A simple vector of numbers.

Now lets evaluate that dataframe in a logical expression:

q_logic<- d < (mean(d) / 10)
q_logic
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

This will evaluate the expression and will return TRUE for all values of d that are less than the mean of d divided by 10. Do you remember PEMDAS? Well in R it is PEMDASL. Logicals are evaluated after the math expression.

In reality this vector of logicals is what ifelse is actually evaluating. In fact, we can just drop that into the ifelse.

ifelse(q_logic, yes="verify", no="OK") #yes is the true condition no is the false condition
## [1] "verify" "verify" "verify" "verify" "OK"     "OK"     "OK"     "OK"

It goes through and replaces every TRUE value in that vector of logicals with “verify” and all of the FALSE values with “OK”.

In reality we seldom develop this kind of statement step by step like that. We would normally just write it on a single line.

q<-ifelse(d < (mean(d) / 10), "verify", "OK")

Now, let’s see what happens if we have some missing data.

d <- c(1:4, 100*1:4, NA)
ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
## [1] "verify" "verify" "verify" "verify" "OK"     "OK"     "OK"     "OK"    
## [9] NA

That NA, gets passed through. Maybe we want that, maybe we don’t. In mean() and many other functions we were able to set na.rm=TRUE and omit the NA value, but reading through help(ifelse), we don’t really have that option. One option is that we can try to have two nested ifelse conditions.

ifelse(is.na(d) == T, "Missing",
  ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
)
## [1] "verify"  "verify"  "verify"  "verify"  "OK"      "OK"      "OK"     
## [8] "OK"      "Missing"

is.na() returns a logical vector with F for all finite elements and T for all NA and NaN (but not Inf). In this nested structure, the F cases are futher evaluated and labeled with the second statement. Nesting is useful, but chances are if you find yourself nesting more than 2 or 3 conditions deep, there is probably a better way to do things.

The next option is if/else First thing to know about the if/else construction in R that it is very sensitive to the bracket placement, else must always be between a set of open brackets like this }else{. If it isn’t you will end up with errors that are hard to track down. In context:

R will not comprehend this:

if(...){thing to do if TRUE}
      else{
      thing to do if FALSE
      }

R will only comprehend this:

if(...){thing to do if TRUE
           }else{
           thing to do if FALSE
           }

This statements expects a single boolean as the result of if() statement. Sometimes I like to use this to create switches or bypasses when I am writing loops/iteration. For example: If I have a dataframe that is non-empty, then do the analysis. If the data frame is empty do something else.

d <- c(1:4, 1:4 * 100, NA)

if (all(is.finite(d))) {
  "no NA, NaN, or Inf present"
} else {
  "NA present"
}
## [1] "NA present"

This also nests nicely and we will get back to that later, but here is what that looks like:

d <- c(1:4, 1:4 * 100)

if (!all(is.finite(d))) {
  "NA present"
} else if (all(d < 100)) {
  "all items less than 100"
} else {
  paste(sum(d > 100), "items more than 100")
}
## [1] "3 items more than 100"

I most commonly use the if/else constructions to create a switch in the code for special handling of undesirable data. However, I usually only go to this if there isn’t a more direct solution. In my mind, ifelse gives a false sense of security. What all is in that ‘else’?

Iteration

Iteration

This brings us to iteration. Iteration is a control structure that causes code to repeat itself while the conditions for iterating are met. We will explain what that means a little later. Practically speaking it allows you to a lot of work without having to be the one doing it. Iteration consists of a few pieces: 1. The data you need to do something to 2. The method of iteration 3. The task you need to do 4. The result of your task

What are some kinds of repetitive tasks you find yourself doing that you are wanting to do more easily in R?

All of those can be accomplished with iteration. There are a lot of ways to accomplish iteration and we are not going to cover them all. We are going to cover some of the most basic and commonly used approaches.

The first one you may be familiar with is the for loop. You give this a list of items and tell it what you want it to do and it will do that thing for each item on the list. In simplest terms that looks like:

for(i in vector){ 
  do thing 1
  do thing 2
  ...
  return output 
}

A few things to pick apart here:

i - i is an arbitrary variable, you could call it q. It will take on the value for the first item in the vector on the first step of iteration. q is then available in the loop as a variable and it takes on each successive value in vec with each step of the iteration.

vector - vector is that vector of items aka the conditions for iteration. These could be numeric or they could be characters or strings. Numeric vectors are common is you need to run through all of the cases. Character vectors are great if you have groups of data you want to work through. All vector does it gives it some instructions on what it should work on and effectively when to stop.

Let’s make sure we understand those two things

vec<-1:9
for (i in vec) { # for the number 1 through 9
         i
}   

Nothing happened because I did not actually tell it to return something from that loop. That isn’t entirely true. Something happened:

i 
## [1] 9

i was created in the environment and it has a value of 9 the last value of the loop. Logically that means it seems to have made it to the end of vec.

for (i in vec) { # for the number 1 through 9
         print(i+1)
}   
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Now we can see it clearly walking though that loop and adding 1 to each value. But, what we really want is we want a vector of something out of it.

#alternatively predefine a container to receive the data
OutDat <- NULL # define an empty variable. This is naughty, see discussion on 'growing' dataframes later on
for (i in 1:9) {
  OutDat[i] <- i + 1 # store that in the ith position of the variable OutDat
}
OutDat
## [1]  2  3  4  5  6  7  8  9 10
Before we move on to more interesting code, I want to pause and note that you will hear us tell you that there are better ways to iterate than a for loop. However, for loops are a fundamental skill for several reasons.
  • They reinforce a fundamental understanding of iterative processing.
  • Everyone (probably) falls back on a for loop eventually.
  • They are the precursor to writing your own functions.
So let’s make a useful for loop. We are going to use raw hobo air temperature files and we will want to do 3 things with those data. These should sound familiar:
    1. Look at the data
    2. We want to calculate some summary statistics.
    3. We want to do a rudimentary QAQC check and flag potentially bad data
    4. We want to save it to a file so we can do some further visual assessment.
library(magrittr)
library(ggplot2)

#create a list of file names that we will read in
fNames <- paste0(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/",
  c(
    "APIS01_20548905_2021_temp.csv",
    "APIS02_20549198_2021_temp.csv",
    "APIS03_20557246_2021_temp.csv",
    "APIS04_20597702_2021_temp.csv",
    "APIS05_20597703_2021_temp.csv"
  )
)
fNames
## [1] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS01_20548905_2021_temp.csv"
## [2] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS02_20549198_2021_temp.csv"
## [3] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS03_20557246_2021_temp.csv"
## [4] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS04_20597702_2021_temp.csv"
## [5] "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/APIS05_20597703_2021_temp.csv"

Note, paste0() is an example of a handy vectorized operation of paste that assume no spaces between arguments.

Let’s get some idea of what we are dealing for the data so we can plan the code. It is almost impossible to read in data from files blindly.

#read in and inspect the first element (for training purposes)
temp<-read.csv(fNames[1], skip = 1, header = T) #hobo data has an extra line at the top we need to skip
head(temp)
##   X. Date.Time..GMT.05.00
## 1  1    03/14/21 17:00:00
## 2  2    03/14/21 18:00:00
## 3  3    03/14/21 19:00:00
## 4  4    03/14/21 20:00:00
## 5  5    03/14/21 21:00:00
## 6  6    03/14/21 22:00:00
##   Temp..Â.F..LGR.S.N..20548905..SEN.S.N..20548905..LBL..APIS01.
## 1                                                        78.274
## 2                                                        36.531
## 3                                                        28.762
## 4                                                        26.668
## 5                                                        26.031
## 6                                                        25.390
##   Bad.Battery..LGR.S.N..20548905. Coupler.Attached..LGR.S.N..20548905.
## 1                                                                     
## 2                                                                     
## 3                                                                     
## 4                                                                     
## 5                                                                     
## 6                                                                     
##   Host.Connected..LGR.S.N..20548905. Stopped..LGR.S.N..20548905.
## 1                                                               
## 2                                                               
## 3                                                               
## 4                                                               
## 5                                                               
## 6                                                               
##   End.Of.File..LGR.S.N..20548905.
## 1                                
## 2                                
## 3                                
## 4                                
## 5                                
## 6

Gross, but we can fix that, it is just headers and we could actually use some of that info, but we will save those games for a different day. Maybe we whould redo this with headers set to F and you can just trust me that I know what is in each column? (Note, I am changing skip=2 otherwise this will treat the headers as data.)

temp<-read.csv(fNames[1], skip = 2, header = F)
names(temp)[1:3]<-c("idx", "DateTime", "T_F") #let me just assing some column names really quick
head(temp)
##   idx          DateTime    T_F V4 V5 V6 V7 V8
## 1   1 03/14/21 17:00:00 78.274               
## 2   2 03/14/21 18:00:00 36.531               
## 3   3 03/14/21 19:00:00 28.762               
## 4   4 03/14/21 20:00:00 26.668               
## 5   5 03/14/21 21:00:00 26.031               
## 6   6 03/14/21 22:00:00 25.390
str(temp) #look at data types
## 'data.frame':    4554 obs. of  8 variables:
##  $ idx     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ DateTime: chr  "03/14/21 17:00:00" "03/14/21 18:00:00" "03/14/21 19:00:00" "03/14/21 20:00:00" ...
##  $ T_F     : num  78.3 36.5 28.8 26.7 26 ...
##  $ V4      : chr  "" "" "" "" ...
##  $ V5      : chr  "" "" "" "" ...
##  $ V6      : chr  "" "" "" "" ...
##  $ V7      : chr  "" "" "" "" ...
##  $ V8      : chr  "" "" "" "" ...
rm(temp) #do a little cleanup 
There are ~5 columns in most of these files.
    Col. 1 is a sequential index column
    Col. 2 is a date and time string
    Col. 3 is the air temperature in degrees F
    Col. 4 is light measured as lux or lumens per ft2 (if it was measured)
    Col. >5 are comment columns

We are only interested in the first 3 columns today.

So, things we are aware of for coding so far:
    -Three columns of data is all we want right now
    -Two columns are numeric/integer
    -One is time stored as a character but we want it to be real time
    -Really ugly headers that need cleanup, so bad I am not even going to read them in.
    -Lastly, because I know these data, one of the files is empty except for the headers :O.

Done with preliminary data inspection.

Summary Statistics Loop

What do we want to do? We are going to create a simple summary statistics data dataframe. The loop will need to handle the empty filename somehow You will need to initialize a empty variable so you can get data out of the for loop. We will use a list for reasons discussed below.

len<-length(fNames) #calculate the number of iterations

SummaryData <- list() #Generate a list that we will put output into. Always rerun prior to loop.

for (i in 1:len) {
  # 1 Read in and generate the data to use
  # 1a extract the filename from a filepath and parse the metadata
  fName <- basename(fNames[i])
  Site <- stringr::str_sub(fName, 1, 6)
  Year <- as.numeric(stringr::str_sub(fName, -13, -10))
  # 1b read in the data, change headers, set time so we can do some math later
  d <- read.csv(fNames[i], skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F")) %>%
    dplyr::mutate(., DateTime2 = as.POSIXct(DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT"))
  # 2 summarize and 3 output the data
  SummaryData[[i]] <- data.frame( #output the data
    # 2 what to summarize for all files 
    Site = Site,
    Year = Year,
    FileName = fName,
    # Variable output
    # 2a What to summarize so long as there is 1 non-NA record
    if (any(is.numeric(d[, 3]))) { # check that there is at least 1 non-NA record
      list(
        T_F_mean = mean(d$T_F, na.rm = T),
        T_F_min = min(d$T_F, na.rm = T),
        T_F_max = max(d$T_F, na.rm = T),
        nObs = nrow(d),
        nDays = (d$DateTime2[nrow(d)] - d$DateTime2[1])[[1]], # why we created a posix
        Flag = NA
      ) 
      # 2b what to summarize if the data are missing.
    } else {
      list(
        T_F_mean = NA,
        T_F_min = NA,
        T_F_max = NA,
        nObs = nrow(d),
        nDays = NA,
        Flag = "Hobo File Present But No Data"
      )
    }
  )
}
dplyr::bind_rows(SummaryData)
##     Site Year                      FileName T_F_mean T_F_min T_F_max nObs
## 1 APIS01 2021 APIS01_20548905_2021_temp.csv 57.55168  14.608 100.283 4554
## 2 APIS02 2021 APIS02_20549198_2021_temp.csv 47.90618 -25.123  92.737 6287
## 3 APIS03 2021 APIS03_20557246_2021_temp.csv       NA      NA      NA    0
## 4 APIS04 2021 APIS04_20597702_2021_temp.csv 49.05190 -15.628  99.885 6286
## 5 APIS05 2021 APIS05_20597703_2021_temp.csv 47.68413 -22.549  91.612 6287
##      nDays                          Flag
## 1 189.6020                          <NA>
## 2 261.8086                          <NA>
## 3       NA Hobo File Present But No Data
## 4 261.8091                          <NA>
## 5 261.8034                          <NA>

Lets take some time to pick this apart and understand what it is doing.

1. The read and generate data step. Is a pretty standard part of a loop where you define some variables that you are going to reuse later. Here, we are just parsing some metadata from the filename. Also a standard part of a loop is where you need to actually read in your data. i usually does not contain your data in a for construction

2. The getting stuff done step. This step is setting up where the data goes – “list element i” – and what data to put there. The nifty thing is that we are calling an if/else statement within the construction of the dataframe. There are dataframe elements that will be fixed and dataframe elements that will be variable.

3. The output step. Here, the output step is integrated into the output step. This is where we put the data into something we can access after the for loop is done.

A note on the output technique. We output data into a list then simplify the list at the end. We could have built a dataframe directly, however, this would have required an approach called ‘growing’ a dataframe. This approach is simple, but is considered highly inefficient and is generally discouraged. It is ok with small dataframes, but with larger ones, it will bog down your system.

Let’s carry on and loop up some other tasks. Next up we want to do some QA/QC flagging on the original data.

QA/QC Loop

OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this later)
# set up some QA thresholds
TempUpper <- 40
TempLower <- (-40)

for (i in fNames) {
  #1 Read in/generate data
  fName <- basename(i)
  d <- read.csv(i, skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F"))
  
  #2 Generate your output in this case, some QA/QC flagging
  d_com <- d %>%
    dplyr::mutate(
      # checks for too big of a temp change
      TChangeFlag = ifelse(c(abs(diff(T_F, lag = 1)) > 10, FALSE), "D10", NA),
      # flag if a temp is too high or too low
      Flag = dplyr::case_when(
        is.na(T_F) ~ "MISSING",
        T_F > TempUpper ~ "High",
        T_F < TempLower ~ "Low"
      )
    ) %>%
    # merge so you can have multiple flags applied
    tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "_", remove = TRUE, na.rm = TRUE)

  #3 outptut the data
  dir.create(OutPath, showWarnings = F)
  write.csv(d_com[, c("DateTime", "T_F", "Flag")],
    paste0(OutPath, gsub(".csv", "_QC.csv", fName)),
    row.names = F, quote = F
  )
}

We used some new functions there, but let’s start at a higher level and work our way down to the lower level functions.

What changed with the for loop? A for loop can accept any vector as input. It could be a vector of names or a could be a vector of numbers. In the first loop we did, I wanted that vector of numbers so I had an automatic index to use. In this case, I just need the name, I don’t need that index.

  1. This is mostly the same as before.
  2. There is new stuff here but this boils down to: “I want to create 2 variables and then merge them. mutate() is creating the variables Flag and TChangeFlag. Flag is using a dplyer version of ifelse called case_when() that allows you to specify what to return if certain conditions are true. TChangeFlag is the first derivative of the temperature data - rate of change of the data. If the rate of chagne is too high, then we flag it. This could have been done with case_when(), but we are talking about control structures today.
  3. I want to point out the ‘on the fly’ file name generation. This takes some practice in paste(). All this is doing is slightly modifying the old name and giving it a new directory.

Plot Loop

Let’s do our last for loop that is pretty commomn and that is the ‘barf out a bunch of plots’ loop. I am visual and can quickly diagnose data by looking at it. I like to do this by having PDFs on hand.

OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this

for (i in fNames) {
  #1 Read in/generate data
  fName <- basename(i)
  d <- read.csv(i, skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F")) %>%
    dplyr::mutate(., DateTime2 = as.POSIXct(DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT"))
  #2 Do the thing you want to do, in this case make and save plots
  p <- ggplot(d, aes(x = DateTime2, y = T_F)) +
    geom_point(size = 0.5) +
    ggtitle(fName)
  #3 Output the results outside of the for loop
  dir.create(OutPath, showWarnings = F)
  ggsave(
    filename = gsub(".csv", "_plot.pdf", fName), 
    path = OutPath, device = "pdf",
    plot = p, width = 5, height = 5, units = "in"
  )
}

Let’s break this down again because practice makes perfect. 0. The iteration hasn’t changed. Same approach, different products. 1. Same. 2. This should be pretty familiar from the data visualization day. 3. I am a huge fan of doing things in base R, but boy does ggsave take the frustration out of saving plots. What I want to point out here is that we don’t need to build the full file path. We specify the file name as discussed before and feed it a path separately.

Recap and final thoughts on for loops and control structures

What did we do? + Learned the basic structure of most for loops. Probably the most important thing. + Used some for loops to do some actual tasks on multiple datasets. + Read files into and wrote files out of a for loop. + Wrote data calculated in a for loop to a variable in the environment. + Used ifelse() logic to flag some data for checking. + Use if...else to create a switch for NA handling in a dataframe in a for loop. + Maybe learned a few new functions. + …?

At this stage you may be thinking. Ok, but you have read the data in multiple times and repeated a lot of code. Couldn’t we just wrap this in one giant for loop?

  • We could have written the loop to read in the data, store it in memory, and then loop only on the stored data for flagging, statistics, and plotting. But, loading stuff in R takes up memory. Depending on the size of your datasets, you may not want unused information floating around hogging memory. Continuous data files can become large objects. So, it depends.

  • There are only rare instances where you want to wrap stuff up in that complicated of a for loop. The more complex the loop the harder it is to debug. Complex for loops are also hard to build flexibility into. In general you should shoot for one task or task family per loop.

Other (better) ways to iterate in R

Now that you have the basics down, we are going to forge ahead with better and better ways to iterate. One family of functions that often replaces for() is the apply family of functions.

apply - operates on things resembling a dataframe. Okay, really it operates on a matrix where all values are the same type (all number or all letters). It can operate by row or by column and the output will be a dataframe,

mapply - Is a way to vectorize a function that isn’t vectorized. I don’t think I have used this in the past year, but it is in the apply family and you might encounter it.

lapply - Apparently, this is a special case of mapply and it operates on lists of things and stores the output in a list of things. sapply and vapply are variations on this theme.

sapply - A ‘wrapper’ for lapply that will coerce the list output into a matrix or array.

tapply - apply by groups. This lets you apply functions by a group. So, quick summary statistics, but you will quickly outgrow this and there are better ways to do this.

We are going to focus on apply and lapply. apply - Going back to that definition, when might you use it?

To explore low level functions in apply, let’s look at mtcars a wonderful little built in datast that I like to use for developing and demoing code (see help(mtcars) for description).

str(mtcars)
head(mtcars)

First let’s calculate the mean across the rows. It doesn’t make a whole lot of sense as to why we would do this because the columns contain values with different units, but we can if we set margin=1.

apply(mtcars,MARGIN=1, FUN=mean) #calculate mean across the row

Note: apply will operate on all the data it is provided. You will have to reduce your data to just that which you want analyzed.

The make/model of the vehicle was stored as rownames so is not technically data being passed to mean for analysis. We get out a vector with the (meaningless) average value for each make and model of vehicle.

It might make more sense to calculate the mean value for each of the attributes of a car. We can do that by setting margin=2.

apply(mtcars,MARGIN=2, FUN=mean)

That returns the mean of each variable. What is happending is the apply is disassembling the data frame and passing it line by line (margin=1) or row by row (margin=2) into mean() and collecting the output into a vector.

Let’s change gears a little bit and return to those hobo data to take a look at lapply. First, a quick review of what is a list. A list is an R object that can hold a bunch of different elements. A list is ok if one element is a string and the next element is a character or if one element has 100 columns and 10 rows while the next element has 10 columns and 100 rows. It even can hold a dataframe a 1 element and a matrix as another element without any trouble. A list says you’re ok just the way you are.

We are going to use our list to read in data that are mostly the same. If we wanted to store them in a single dataframe: 1) it would get unwieldy and 2) you would have to write code that ‘fixed’ inconsistencies between the dataframes. By storing them as independent list elements we don’t necessarily need to write code that ‘fixes’ the inconsistencies.

We discussed this earlier, in the previous for loop we operated on each of our hobo csvs one at a time. What we are going to do this time is open them all into a single list object with 5 different elmenets (1 for each csv file).

You should still have fNames hanging out somewhere. This is that list of hobos.

fNames

What we will do next is iterate through that list of names and store it in a list. We do that with lapply. What does lapply do? It takes in a list or vector of data and returns list of data of the same length. lapply(X, FUN,...) has 3 arguments. + X the data or the vector to FUN + FUN what to do + ... elipsis - this is a bit of an odd one. This is where you put whatever other arguments you want passed to FUN. Looking at the examples below will help.

Get the data

HoboList<-lapply(fNames,read.csv, skip=1, header=T)
HoboList

This is passing, sequentially, file paths into read.csv(). Just like our for loop more or less. It is also passing skip and header values to read.csv(). It returned a list of all the csvs with everything in them.

Simplify the Data

HoboList<-lapply(HoboList,"[",,1:3)
HoboList  

This then takes that list as input and passes each of the dataframes contained within it to the indexing function. That is right, indexing [] is a function and we can pass it into FUN by calling "[". What fun! Not the empty commas there "[",,1:3 are intentional and will be interpreted the same as saying x[,1:3] or… go get columns 1 through 3.

Clean those headers

HoboList<-lapply(HoboList,setNames, c("idx","DateTime","T_F"))
HoboList

Then we simply just rename the headers because lapply passes each dataframe to the list.

All together now with the piping!

HoboList<-lapply(fNames,read.csv, skip=1, header=T)%>% #1. read hobo data into a list
          lapply(.,"[",,1:3)%>% #2. Grab only first 3 columns
          lapply(.,setNames, c("idx","DateTime","T_F")) #3. fix column names

Now all of those hobo files are available in the HoboList object that we could summarize, do some flagging, or plot from. How do we do that? We could do for loops again, or we could keep going with lapply.

For now, let’s create a little bit of summary data.

HoboSummary<-lapply(HoboList[-3], FUN=function(x){ #-3 drop that blank file for now
  d<-data.frame(
    Mean = mean(x$T_F, na.rm=T),
    Min = min(x$T_F, na.rm=T),
    Max = max(x$T_F, na.rm=T),
    nObs = nrow(x)
  )
})
do.call(rbind, HoboSummary) #equivalent to dplyr::bind_rows(HoboSummary)

This is a stripped down version of the for loop we did before. We then use do.call(), the base equivalent to dplyr::bind_rows(), to generate a dataframe. do.call takes a function (here rbind) and passes it a list of arguments (here list elements). So, tt passes all of the individual elements of HoboSummary to rbind as individual arguments to glue together.

We could then do the flagging step and the plotting steps. However, we are going to stop here and pick those up in the advanced R session because we really need to talk about how to make a function() before we go a whole lot further and that is a topic for next time.

A note on while loops

Another type of loop you should be aware of is the while loop. It probably won’t come up much when you’re working with data in R, but as a programmer you should have it in your toolbox. While the apply family of functions is specific to R, for loops and while loops are present in most programming languages.

In a for loop, you decide ahead of time how many times the code inside your loop should execute. The for loop iterates through a vector of a certain length and stops when it reaches the end.

In a while loop, on the other hand, your loop executes while some condition is true. When that condition is no longer true, the loop stops. while loops are good for situations when you can’t predict ahead of time how many times the loop needs to execute.

As an example, we’ll make a simple guessing game. The code below selects a random integer between 1 and 10. It prompts the user to type a guess at the console. If they guess correctly within three tries, they win.

guess <- 0  # Initialize the guess variable
number <- sample(1:10, 1)  # Get our random number
guess_count <- 0  # Number of guesses so far

while (guess != number && guess_count < 3) {  # Keep looping if the guess is wrong and we haven't hit our 3 guesses yet
  guess_count <- guess_count + 1
  message <- paste0("Guess ", guess_count, ": what is the random number?  ")
  guess <- as.integer(readline(prompt = message))
  
  if (guess == number) {
    print("Woohoo! You won the game!")
  } else if (guess_count < 3) {
    print("Nope, try again.")
  } else {
    print("You lost :(")
  }
}

When you are writing a while loop, make sure that it will eventually terminate! If the test expression (in this case guess != number && guess_count < 3) never becomes false, your while loop will go on forever! An infinite while loop will ungracefully crash your code, so be sure to test your loop thoroughly, especially if it’s more complex.

Troubleshooting Iteration

Jointly, Thomas take first crack. SARAH will do breakpoints. Thomas Do examples and posting to Stack for iteration troubleshooting Both sarah and Thomas useful resources

Troubleshooting and good resources -Troubleshooting iteration -Simple and complex approaches -in development use verbose iteration #covered -print() to give an update #covered -breakpoints debugging tool in r studio -traceback - certainly an option #Covered needs an example maybe -should you just ignore warnings()? #covered

-Minimum reproducible examples #covered

Overview

Believe it or not, there are only two types of errors and both are human. 1) Your code is written incorrectly (let’s go find that missing ‘)’…). 2) You are telling your code to do something with data that it can’t do with that data (let’s figure out where we need to set na.rm=T). Another branch of troubleshooting is optimization - “why is my code so slow?” We will cover that next week.

Troubleshooting iteration is often challenging and frustrating. Sometimes errors are obvious sometimes errors only show up when you start digging into outputs. Sometimes it works fine on the test data and fails on the real data. Ultimately, debugging is an exercise in hypothesis testing “If I do this, I should get this result. Where did I not get the expected result and why?” We present you with some basic manual approaches and resources here that should help you debug iteration. As you code you will find the approach that works best for you in each situation.

Trouble cues

Clues you need to troubleshoot

Warnings - Warnings are you prompt to check if there is a problem. Code will seldom fail to provide output on a warning, the question is whether or not it is the output you are after. When you see warnings you should run warnings() and see if it makes sense as to why there are warnings. ggplot, for example, will issue warnings for each incomplete record (e.g. a value for y, but not x) and will continue to plot everything that is complete. It is always good to pay attention to the number of warnings and periodically check that they are expected. If they are unexpected you may wish to dig deeper.

Errors - In short, errors occur when the functions used in your code encounter data that they can’t use. Sometimes an error tells you what the problem is, more often it doesn’t. Many errors are general and not specific to a function so you will want to Google the function and the error to see what people are saying. May take a few iterations to get the right search terms.

Unexpected behavior - If you expect that 1+1=2 but your code returns 3, that is a good indicator you need to troubleshoot. It is always good to feed your code some data for which you already know the answer or output. Both positive and negative examples should be used. A positive example is one where you know it should work normally and a negative example is one where you know it should break. I often do this during code development as I am building different modules or chunks. Fundamentally I have two checks: 1) Is it doing what I expect it to do? 2) What happens if I slip an NA/NaN/Inf or an empty dataframe in there?

Finding Errors

Tracking down the error

Tracking the error down is usually the first step in fixing it. Once you track it down you can see which functions are involved in it or which data are involved in it. There are generally two approaches to this: manual and R-assisted. I am going to focus on the manual, because at some point you will have to fall back on that so it is good to have a solid understanding of that approach before you branch out to R-assisted approaches. For more on R-assisted approaches see

#add HTML hyperlink to Kate’s session on debugging.

  1. Don’t try and pack too much into your code. Writing ‘everything loops’ means that you will have a lot of functions. Different functions have different inputs and outputs. The more different data structures and outputs your iteration will have to handle the more likely it is to break or provide unexpected results. Iteration works really well when it has 1 input structure and 1 output structure.

  2. Make your code verbose - Iteration usually works until it doesn’t - it will loop up to the point or dataset that breaks it. Depending on how you have written your code, you likely don’t know where it actually failed. That is a fact of life with R and iteration. Iteration tends to hide errors and warnings and makes it difficult to see where things went wrong. Adding a print() command that prints a dataset name or a piece of data that allows you to evaluate progress allows you quickly to determine if the code is functioning properly. When you are done testing a developing a function, you can then remove it.

As an example let’s look at a early version of the super for loop we looked at earlier.

library(magrittr)
library(ggplot2)
fNames <- paste0(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/",
  c(
    "APIS01_20548905_2021_temp.csv",
    "APIS02_20549198_2021_temp.csv",
    "APIS03_20557246_2021_temp.csv",
    "APIS04_20597702_2021_temp.csv",
    "APIS05_20597703_2021_temp.csv"
  )
)

OutPath <- choose.dir() # note, no trailing '\\' in path
TempUpper <- 40
TempLower <- (-40)

SummaryData <- NULL # make sure this gets cleared before rerunning this code
for (i in fNames) {
  d <- read.csv(i, skip = 1, header = T)[1:3]
  fName <- basename(i)
  print(fName) # dev/debug
  names(d) <- c("idx", "DateTime", "T_F")
  d$DateTime2 <- as.POSIXct(d$DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT")
  # summarize the data
  OutDat <- data.frame(
    FileName = fName,
    T_F_mean = mean(d$T_F, na.rm = T),
    T_F_min = min(d$T_F, na.rm = T),
    T_F_max = max(d$T_F, na.rm = T),
    nObs = nrow(d),
    nDays = (d$DateTime2[nrow(d)] - d$DateTime2[1])[[1]]
  ) # days
  # Append new summary data to old summary data
  SummaryData <- rbind(SummaryData, OutDat)
}

In this case, the code was “fine,” but it was not equipped to handle the data being thrown at it and it stopped on the dataset/loop step that was the problem. This, manual approach can be used to detect both code problems and data problems.

A few rules of thumb: if it returns an error on the first step of iteration, then it is a code or a data-code match problem. If it can iterate properly at least once, then the code ‘works’, but there is a code-data mismatch.

Step away from the iteration - The problem is seldom the iteration itself. In this approach you set the input data to the known issue or case you want to test and don’t attempt to iterate over all the data. Without invoking the iteration, you then run each step of your code line by line to determine where the error occurs. This is used for fixing a code error and for troubleshooting a dataset. What does that look like?

#effectively these are manual "breakpoints"

i=fNames[3] #code dev/debugging fragment
# rm(i) #debug.dev frag
SummaryData <- NULL # make sure this gets cleared before rerunning this code
# for(i in fNames){
d <- read.csv(i, skip = 1, header = T)[1:3]
d # check output
fName <- basename(i)
# print(fName)#dev/debug frag
names(d) <- c("idx", "DateTime", "T_F")
d$DateTime2 <- as.POSIXct(d$DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT")
# summarize the data
OutDat <- data.frame(
  FileName = fName,
  T_F_mean = mean(d$T_F, na.rm = T),
  T_F_min = min(d$T_F, na.rm = T),
  T_F_max = max(d$T_F, na.rm = T),
  nObs = nrow(d),
  nDays = (d$DateTime2[nrow(d)] - d$DateTime2[1])[[1]]
) # days
# Append new summary data to old summary data
SummaryData <- rbind(SummaryData, OutDat)
#  }

!!!!!!!!!!!!!!!!!Add breakpoints discussion here!!!!!!!!!!!!!!!!!!!!!!!! -live demo -use uncommented out look an do some screenshots

Step away from your data - Sometimes we let our data over-complicate the coding task. There are few coding sessions where I don’t end up trying to see if I can make something work on a built in dataset (e.g. mtcars). Often when trying to do the same thing, but with different data, I end up seeing and solving the problem. This approach also forces you to reduce your code to the bare basics… which you may need when you go to ask for help.

Process wise (you will develop your own): 1) Develop the code that works on the test example 2) Wrap it in your iterative loop 3) Run the for loop on more data 4) Identify bugs 5) unwrap the for loop and debug specific cases causing problems

Lastly, don’t spin your wheels by yourself for too long. Coding is a team effort.

Ask for help

We have talked about getting help before, but here we want to cover the “Minimim Reproducible Example (MRE)” and asking for help on Stack Overflow. This is actually a crucial part of debugging. Often the process of developing the MRE will actually answer your question.

Reproducible Examples

When you ask for help (on stackoverflow or in your office) you should provide an example of the problem you are having complete with data. This is called the minimal reproducible example (MRE). It is easiest on you and the people trying to help you if it does not use your data. In the end you want to be able to create a fairly simple statement: “I am trying to _____. Here is the code I have so far. I want it to produce output that looks like ____. However, it is not currently doing that, instead it is providing this output _____. What did I do wrong?”

Steps for getting help / developing an MRE on Stackoverflow**:

  1. Identify a common built-in dataset that has a structure that is vaguely similar to your data or write some simple code that mimics what you are trying to deal with. A good one to use is often mtcars which is in ggplot2. It has integer data, continuous data, and categorical data. If you are using a random number generator for you data (e.g. rnorm then using set.seed() is advisable.) Don’t try to reproduce your entire dataset, reproduce only what your code is using. If you have 40 variables, and your code only operates on 8, provide a dataframe with 8 variables (or 9 in the unlikely event the problem is in the unused variable).

  2. Strip your code down to the essentials - sometimes the bigger context of your code is relevant, usually it isn’t. A. Remove variable names that are special to your data. Rename them simple things like x and y. B. Decide which steps are ‘bells and whistles’ and what steps are essential to the problem you are having or essential to what the code needs to do. Remove the bells and whistles. Check to see if the code behavior changes with each removed step.

  3. Be able to clearly provide an example of what the code should produce as output. Your code does not need to be generating that output be doing that.

  4. Type your question in plain english avoiding jargon. Use proper punctuation, grammar, etc. Keep it short, but provide details.

  5. Assign some useful R tags, usually r and the name of the function you think is the issue.

  6. Reply to people, thank them for their suggestions, provide clarificatin, select an answer. Avoid initiating a wholly new question in the comments/discussion.

Here is an example of a question I posted in preparation for this training. Remeber that ugly for loop? It really bothered me so I wanted to see if there was a better way to do one part of it!

https://stackoverflow.com/questions/70747292/assigning-multiple-text-flags-to-an-observation/70748841#70748841

Is that solution better than what I had? Is it better than the other answer? Maybe. In the end I think the solution I selected will make it easier to add other flags that may come up. But, I do have to call the tidyverse library, if that is the environment you work in then that is not a concern. Maybe there is slightly less code here. It is visually more appealing. I do like the direct handling of NA values.

Useful resources

Resources

https://stackoverflow.com/help/minimal-reproducible-example https://adv-r.hadley.nz/debugging.html

Traceback - this is a somewhat newer feature of RStudio. Upon encountering an error, initiating the traceback and take you back up to the error. I find that it only sort of gets you close and isn’t super helpful in saying ‘this is what is wrong’. I look at it to get to the general location of the error. Sometimes that is the problem spot, sometimes the error is actually upstream and the spot it identifies just can’t handle the output from the error upstream.

#Thomas note ot self point the group to Kate’s debugging session.

Resources

There’s a lot of great online material for learning new applications of R. The ones we’ve used the most are listed below.

Online Books

  • R for Data Science First author is Hadley Wickham, the programmer behind the tidyverse. There’s a lot of good stuff in here, including a chapter on using R Markdown, which is what we used to generate this training website.
  • ggplot2: Elegant Graphics for Data Analysis A great reference on ggplot2 also by Hadley Wickham.
  • Mastering Software Development in R First author is Roger Peng, a Biostatistics professors at John Hopkins, who has taught a lot of undergrad/grad students how to use R. He’s also one of the hosts of Not So Standard Deviations podcast. His intro to ggplot is great. He’s also got a lot of more advanced topics in this book, like making functions and packages.
  • R Packages Another book by Hadley Wickham that teaches you how to build, debug, and test R packages.
  • Advanced R Yet another book by Hadley Wickham that helps you understand more about how R works under the hood, how it relates to other programming languages, and how to build packages.
  • Mastering Shiny And another Hadley Wickham book on building shiny apps.

Other useful sites

  • NPS_IMD_Data_Science_and_Visualization > Community of Practice is an IMD work group that meets once a month talk about R and Data Science. There are also notes, materials and recordings from previous meetings, a Wiki with helpful tips, and the chat is a great place to post questions or cool tips you’ve come across.
  • STAT545 Jenny Bryan’s site that accompanies the graduate level stats class of the same name. She includes topics on best practices for coding, and not just how to make scripts work. It’s really well done.
  • RStudio home page There’s a ton of info in the Resources tab on this site, including cheatsheets for each package developed by RStudio (ie tidyverse packages), webinars, presentations from past RStudio Conferences, etc.
  • RStudio list of useful R packages by topic
  • Happy Git with R If you find yourself wanting to go down the path of hosting your code on github, this site will walk you through the process of linking github to RStudio.

Resources by Day:

Day 3: Data Visualization

Using ggplot2


Visualizing Spatial Data
  • Geocomputation in R is an excellent online book on the sf and tmap packages.
  • R Studio’s leaflet for R page shows the basics of how to use leaflet.
  • R-spatial mapping in ggplot The r-spatial.org site is a blog with lots of interesting topics covered. The blog I linked to helped me figure out how to map shapefiles in ggplot.


Code printout

This tab prints all of the code chunks in this file in one long file.

knitr::opts_chunk$set(warning=FALSE, message=FALSE)

# By using this hashtag/pound sign, you are telling R to ignore the text afterwards. This is useful for leaving annotation of notes or instructions for yourself, or someone else using your code

# try this line to generate some basic text and become familiar with where results will appear:
print("Hello, lets do some basic math. Results of operations will appear here")

# one plus one
1+1

# two times three, divided by four
(2*3)/4

# basic mathematical and trigonometric functions are fairly similar to what they would be in excel

# calculate the square root of 9
sqrt(9)

# calculate the cube root of 8 (remember that x^(1/n) gives you the nth root of x)
8^(1/3)

# get the cosine of 180 degrees - note that trig functions in R expect angles in radians
# also note that pi is a built-in constant in R
cos(pi)

# calculate 5 to the tenth power
5^10
# the value of 12.098 is assigned to variable 'a'
a <- 12.098

# and the value 65.3475 is assigned to variable 'b'
b <- 65.3475

# we can now perform whatever mathematical operations we want using these two variables without having to repeatedly type out the actual numbers:

a*b

(a^b)/((b+a))

sqrt((a^7)/(b*2))
# read in the data from tree_growth.csv and assign it as a dataframe to the variable "tree_data"
tree_data <- read.csv("data/tree_growth.csv")

# display the 'tree_data' data frame we just created
tree_data
head(tree_data)  # Show the first few lines of the dataframe. Defaults to showing the first 6 rows
head(tree_data, n = 10)  # Show the first 10 rows

summary(tree_data)  # View a basic statistical summary

str(tree_data)  # Get info about the structure of the data
View(tree_data)  # Open a new tab with a spreadsheet view of the data
digits <- 0:9  # Use x:y to create a sequence of integers starting at x and ending at y
digits
is_odd <- rep(c(FALSE, TRUE), 5)  # Use rep(x, n) to create a vector by repeating x n times 
is_odd
shoe_sizes <- c(7, 7.5, 8, 8.5, 9, 9.5, 10, 10.5, 11, 11.5)
shoe_sizes
favorite_birds <- c("greater roadrunner", "Costa's hummingbird", "kakapo")
favorite_birds
is.vector(digits)  # Should be TRUE
is.vector(favorite_birds)  # Should also be TRUE

length(digits)  # Hopefully this is 10
length(shoe_sizes)

# Even single values in R are stored as vectors
length_one_chr <- "length one vector"
length_one_int <- 4
is.vector(length_one_chr)
is.vector(length_one_int)
length(length_one_chr)
length(length_one_int)
class(favorite_birds)
class(shoe_sizes)
class(digits)
class(is_odd)
second_favorite_bird <- favorite_birds[2]
second_favorite_bird
top_two_birds <- favorite_birds[c(1,2)]
top_two_birds
odd_digits <- digits[is_odd]
odd_digits
tree_data$Species
tree_data$Age_yr
tree_data[, "Species"]  # Same as tree_data$Species
tree_data[, c("Species", "Age_yr")]  # Data frame with only Species and Age_yr columns
tree_data[1:5, "Species"]  # First five elements of Species column
is.vector(tree_data$Species)
class(tree_data$Species)
is.vector(tree_data$Age_yr)
class(tree_data$Age_yr)

str(tree_data)  # Check out the abbreviations next to the column names: chr, num, and int. These are the data types of the columns.
#to calculate the mean of the 'age' column in the original dataframe:
mean(tree_data$Age_yr)

#or to calculate the mean of the DBH vector we created:
dbh <- tree_data$DBH_in
mean(dbh)

#like-wise for the median, standard deviation, and minimum:
median(tree_data$Age_yr)
sd(tree_data$Age_yr)
min(tree_data$Age_yr)
unique(tree_data$Species)  # in this line, we are printing all of the unique species names in the dataset (in this case 2).
species_names <- unique(tree_data$Species)  # assign unique species names to a variable
in_per_ft <- 12  # Since there are 12 inches in a foot
# here we are specifying the new column on the left side of the '<-', and telling R what we want put into it on the right side of the '<-'.
tree_data$Height_in <- tree_data$Height_ft * in_per_ft

# we can now use the 'head function to check our work, and make sure the proper conversion was carried out:
head(tree_data)
hist(x = tree_data$Age_yr)
# this should generate an ugly but effective scatterplot of tree height vs age. It would appear that older trees are taller!
plot(x = tree_data$Age_yr, y = tree_data$Height_ft)
# let's try the same, but with DBH as the dependent variable on the y axis:
# again, we should see a plot below showing that, even in a make-believe forest, older trees of both species tend to be thicker!
plot(x = tree_data$Age_yr, y = tree_data$DBH_in)
?plot
?dplyr::filter
install.packages("janitor")
library('janitor')
library('janitor')
install.packages("tidyverse")
library('tidyverse')
library('janitor')
library('tidyverse')
messy_data <- tibble(TrailName = c('Bayside', 'Coastal', 'Tidepools', 
                                   'Monument', 'Tidepools', NA),
                     Visitor_count = c(200, 300, 400, 500, 400, NA),
                     `empty row` = NA)
messy_data
clean_data <- clean_names(messy_data)
clean_data
remove_empty(
  # first argument is the data
  dat, 
  # the next specifies if you want to get rid of empty rows, columns, or both
  which = c("rows", "cols"), 
  # this last one asks whether (TRUE) or not (FALSE) you want R to tell you which rows or columns were removed
  quiet = TRUE)
clean_data2 <- remove_empty(clean_data, which = c('rows'), quiet = FALSE)
clean_data2
clean_data2 <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data2
clean_data <- clean_names(messy_data)
clean_data <- remove_empty(clean_data, which = c('rows', 'cols'), quiet = TRUE)
clean_data
clean_data <- clean_names(messy_data) %>%
              remove_empty(which = c('rows', 'cols'), quiet = TRUE)
clean_data
# note that replacing %>% with |> also works
clean_data <- clean_names(messy_data) %>%
              remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
              distinct()
clean_data
messy_data2 <- tibble(`Park Code` = c(NA, NA, 'CABR', 'CHIS', 'SAMO', 'SAMO'),
                     visitor_count = c(NA, NA, 400, 500, 400, 400),
                     `empty row` = NA)
clean_data2 <- clean_names(messy_data2) %>%
               remove_empty(which = c('rows', 'cols'), quiet = TRUE) %>%
               distinct()
library('readr')
library('janitor')
library('tidyverse')
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")
# let's see the column names in park_visits
colnames(park_visits)

# we can also look at the first few rows by using head()

head(park_visits)

# you can either select for the columns you want
park_visits1 <- park_visits %>% select(year, parkname, unit_code, state, visitors)

# or against the columns you don't want, by using the minus sign before the column name
park_visits2 <- park_visits %>% select(-gnis_id, -geometry, -metadata, -number_of_records, -region, -unit_name, -unit_type)

# you can check if you have the same result by using colnames()
colnames(park_visits1)
colnames(park_visits2)

# to clear up our environment, we'll get rid of park_visits1 and 2 using remove()
remove(park_visits1, park_visits2)

# since we won't need all of the columns in park_visits, we'll change that file (note that park_visits is on both the left and right sides of the pipe)
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)
# rename parkname column to make it snake case
park_visits <- park_visits %>% rename(park_name = parkname)

# we can check if we've done it correctly by looking at the column names
colnames(park_visits)
wa_park_visits <- filter(
                        # name of data.frame
                        park_visits, 
                        # condition
                        state == 'WA'
                        )

# head() shows the first few rows of data
head(wa_park_visits)

# you can use the unique() to check you've done the filtering correctly
unique(wa_park_visits$state)
wa_park_visits <- park_visits %>%
                  # filter for parks in WA
                  filter(state == 'WA' &
                  # filter for total visitation
                  year == 'Total' & 
                  # filter for over 5 million visitors
                  visitors > 5000000)

# head() shows the first few rows of data
head(wa_park_visits)
pr_park_visits <- park_visits %>%
                  # filter for parks in Puerto Rico
                  filter(state == 'PR' &
                  # filter for total visitation
                  year == 'Total')

# head() shows the first few rows of data
head(pr_park_visits)
# San Juan National Historic Site (Puerto Rico) had > 55 million visitors!

# remove result from environment
remove(pr_park_visits)
wa_park_visits <- park_visits %>%
                  # filter for parks in WA
                  filter(state == 'WA' &
                  # filter for total visitation
                  year == 'Total' & 
                  # filter for over 5 million visitors
                  visitors > 5000000) %>%
                  # arrange by visitation - default is ascending order
                  arrange(visitors)

# look at result (df is short so head() not needed)
wa_park_visits

# to arrange by descending order, we can use desc()
wa_park_visits2 <- wa_park_visits %>%
                  arrange(desc(visitors))

# look at result 
wa_park_visits2

# remove one result
remove(wa_park_visits2)
# two ways to get the number of observations per state

# option 1 - use pull() to isolate a column and table() to get # of observations
state_n <- park_visits %>%
           # pull the state column only
           pull(state) %>%
           # get # of observations by state in table output
           table()

# option 2 - use group_by() to isolate column(s) and tally() to get # of obs
state_n2 <- park_visits %>%
           # group by state
           group_by(state) %>%
           # get tally of observations by state in output
           tally()

# remove created objects
remove(state_n, state_n2)
wa_park_visits_millions <- wa_park_visits %>%
                           mutate(visitors_mil = visitors/1000000)

wa_park_visits_millions
park_visits2 <- park_visits %>%
                # remove the rows with Totals per park
                filter(year != "Total")  %>% 
                # do calculations by park
                group_by(park_name) %>%
                # add visit_percent column
                # visits fore each year divided by the sum of total visits for each park
                # the na.rm = TRUE means that NA values are not included in calculations
                # round(2) is rounds result to 2 decimal places for easier reading
                mutate(visit_percent = (100*visitors/sum(visitors, na.rm = TRUE)) %>%
                         round(2)) 

# take a look at the result
head(park_visits2)

# remove from environment
remove(park_visits2)
state_summary <- park_visits %>%
                # filter for total visitation numbers (to avoid double counts)
                filter(year == 'Total') %>%
                # do calculations by state
                group_by(state) %>%
                # calculate summaries
                summarize(
                  # mean number of total visitors across parks in each state
                  mean_visit = mean(visitors, na.rm = TRUE),
                  # sd of total visitors across parks in each state
                  sd_visit = sd(visitors, na.rm = TRUE),
                  # min total visitors across parks in each state
                  min_visit = min(visitors, na.rm = TRUE),
                  # max total visitors across parks in each state
                  max_visit = max(visitors, na.rm = TRUE),
                  # get number of park units w data in each state
                  n_parks = n()
                )

# take a look at the result
head(state_summary)

# if you want to continue analyzing this summarized data, you must first ungroup()
# for example:
state_summary <- state_summary %>%
                 ungroup() %>%
                 mutate(dif = max_visit - min_visit)

# remove from environment
remove(state_summary)
challenge2 <- park_visits %>%
                # filter for total visitation numbers (to avoid double counts)
                filter(year == 'Total') %>%
                # do calculations by state
                group_by(state) %>%
                # calculate summaries
                summarize(
                  # mean number of total visitors across parks in each state
                  mean_visit = mean(visitors, na.rm = TRUE),
                  # get number of park units w data in each state
                  n_parks = n()
                ) %>%
                # ungroup to do another calculation
                ungroup() %>%
                # calculate visit/n
                mutate(visit_per_park = mean_visit/n_parks) %>%
                # select relevant columns
                select(state, visit_per_park) %>%
                # arrange in descending order
                arrange(desc(visit_per_park))
                

# take a look at the result
head(challenge2)

# remove from environment
remove(challenge2)
# call packages
library("readr")
library("janitor")
library("tidyverse")

# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

# cut some columns from park_visits data
park_visits <- park_visits %>% select(year, parkname, unit_code, state, visitors)

# we'll first take a subset of the park visits data
medn_visits <- park_visits %>%
  # get parks in MEDN and no totals, years 2010-2015
  filter(unit_code %in% c("CABR", "CHIS", "SAMO") & year != "Total") %>%
  # make year a number, since no more text
  mutate(year = as.numeric(year)) %>%
  # get years 2010:2015
  filter(year >= 2010 & year <= 2015) %>%
  # arrange in ascending order
  arrange(year) %>%
  # select relevant columns
  select(unit_code, year, visitors)

# if we take a look at the data, it is in long format, each observation is a row
# this isn't the most human-friendly way to look at data, but it is really helpful if we want to use dplyr group_by() functions like mutate() and summarize()
medn_visits

# conversely, we can pivot longer to make the data wide format
# it is more human-friendly but you can't use dplyr functions on it easily
medn_visits_wide <- pivot_wider(medn_visits, names_from = year, values_from = visitors)
medn_visits_wide
# check out the original data (medn_visits) in long format
medn_visits

# if we want to make each row a park and each column a year, we can pivot wider
medn_visits_wide <- medn_visits %>%
  pivot_wider(
    # names from is the thing you want to be the columns
    names_from = year,
    # values from is what you want to fill the columns with
    values_from = visitors
  )

# check out the result
medn_visits_wide

# we can also make park units the columns and fill it with visitor data for each year
medn_visits_wide <- medn_visits %>%
  pivot_wider(names_from = unit_code, values_from = visitors)
medn_visits_wide
# check out the original data (medn_visits_wide) in wide format
medn_visits_wide

# if we want to make each row an observation, with an associated park and year, we'll pivot_longer
medn_visits_long <- medn_visits_wide %>%
  pivot_longer(
    # first argument is the columns you'd like to transform
    SAMO:CHIS,
    # next is what you'd like to name the former name cols
    names_to = "unit_code",
    # last is what you'd like to name the former values cols
    values_to = "visitors"
  )

# check out the result
medn_visits_long

# this should look familiar - like the original medn_visits data
# get 3 states of choice (FL, CA, AK) for years 2010-2015
solution_original <- state_pop %>%
  filter(state %in% c("FL", "AK", "CA") &
    year >= 2010 & year <= 2015)
# pivot wider
solution_wide <- solution_original %>%
  pivot_wider(
    names_from = "year",
    values_from = "pop"
  )

# pivot longer to return to format of solution_original
solution_long <- solution_wide %>%
  pivot_longer(`2010`:`2015`, names_to = "year", values_to = "pop")
# call packages
library("readr")
library("janitor")
library("tidyverse")

# get data
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")
gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv")

# cut some columns from park_visits data and make sure year is nuemric data only
park_visits <- park_visits %>%
  select(year, parkname, unit_code, state, visitors) %>%
  filter(year != "Total") %>%
  mutate(year = as.integer(year))

# we'll first make two subsets of the park_visits data
SHEN_visits <- park_visits %>%
  # get data from Shenandoah.
  filter(unit_code == "SHEN")

SHEN_visits

ACAD_visits <- park_visits %>%
  # get data from Acadia.
  filter(unit_code == "ACAD")

ACAD_visits

# now we will combine the 2 data.frames into one.
SHEN_ACAD_visits <- bind_rows(SHEN_visits, ACAD_visits)

# check that data from both parks is there
SHEN_ACAD_visits %>%
  pull(unit_code) %>%
  table()

rm(SHEN_ACAD_visits, SHEN_visits, ACAD_visits)

# we'll first make two subsets of the park_visits data
visits_left <- park_visits %>%
  # get first 2 columns.
  select(year, parkname)

visits_left

visits_right <- park_visits %>%
  #  get last 3 columns.
  select(unit_code, state, visitors)

visits_right

visits_bound <- bind_cols(visits_left, visits_right)

visits_bound

rm(visits_left, visits_right, visits_bound)

# Atlantic islands
Atlantic <- park_visits %>%
  filter(state %in% c("PR", "VI"))
# Pacific Islands
Pacific <- park_visits %>%
  filter(state %in% c("AS", "GU", "HI"))

# All islands
Islands <- bind_rows(Atlantic, Pacific)


rm(Atlantic, Pacific, Islands)

# filter the state data
state_pop2 <- state_pop %>% filter(year < 2000)

# left_join example
state_pop_gas1 <- left_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas1

# left_join example
state_pop_gas2 <- right_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas2

# inner_join example 

state_pop_gas3 <- inner_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas3

# full_join example 

state_pop_gas4 <- full_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas4

rm(state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4)


park_visits_pop <- left_join(park_visits, state_pop, by = c("state", "year"))

rm(park_visits_pop)

# semi_join example 

state_pop_gas5 <- semi_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas5

# anti_join example 

state_pop_gas6 <- anti_join(
  x = state_pop2, # first data.frame
  y = gas_price, # second data.frame
  by = "year" # key column
)

state_pop_gas6

# clean up
rm(state_pop2, state_pop_gas1, state_pop_gas2, state_pop_gas3, state_pop_gas4, state_pop_gas5, state_pop_gas6)


## In this case we filter out the NA values in the semi_join function. You could also filter them our first and save that to a new data.frame and then do the join.

park_visits_filtered<-semi_join(park_visits, state_pop %>% filter(!is.na(pop)), by=c("state", "year")) %>% 
  semi_join(gas_price, by="year")

park_visits_filtered


rm(park_visits_filtered)
library(tidyverse)
library(knitr)
library(kableExtra)
library(here)
covid_numbers <- readr::read_csv(here::here("Data", "covid_numbers.csv"))
head(covid_numbers, 12) %>%
  knitr::kable(align = "c", caption = "Daily Covid cases and population numbers by state (only showing first 12 records)") %>%
  kableExtra::kable_styling(full_width = F)
acme_in <- read_csv(here::here("Data", "acme_sales.csv")) %>%
  dplyr::arrange(category, product) 
acme_in %>%
  knitr::kable(align = "c", caption = "Average monthly revenue (in $1000's) from Acme product sales, 1950 - 2020") %>%
  kableExtra::kable_styling(full_width = F)
acme <- acme_in %>%
  pivot_longer(-c(category, product), names_to = "month", values_to = "revenue")
acme$month <- factor(acme$month, levels = month.abb)

ggplot(acme, aes(x=month, y=product, fill=revenue)) + 
  geom_raster() +
  geom_text(aes(label=revenue, color = revenue > 1250)) + # color of text conditional on revenue relative to 1250
  scale_color_manual(guide = "none", values = c("black", "white")) + # set color of text
  scale_fill_viridis_c(direction = -1, name = "Monthly revenue,\nin $1000's") +
  scale_y_discrete(limits=rev) + # reverses order of y-axis bc ggplot reverses it from the data
  labs(title = "Average monthly revenue (in $1000's) from Acme product sales, 1950 - 2020", x = "Month", y = "Product") + 
  theme_bw(base_size = 12) +
  facet_grid(rows = vars(category), scales = "free") # set scales to free so each facet only shows its own levels
ansc <- anscombe %>%
  dplyr::select(x1, y1, x2, y2, x3, y3, x4, y4) 

ansc %>%
  knitr::kable(align = "c", caption = "Anscombe's Quartet - Four bivariate datasets with identical summary statistics") %>%
  kableExtra::column_spec (c(2,4,6),border_left = F, border_right = T) %>%
  kableExtra::kable_styling(full_width = F)

sapply(ansc, function(x) c(mean=round(mean(x), 2), var=round(var(x), 2))) %>%
  knitr::kable(align = "c", caption = "Means and variances are identical in the four datasets. The correlation between x and y (r = 0.82) is also identical across the datasets.") %>%
  kableExtra::column_spec (c(1,3,5,7),border_left = F, border_right = T) %>%
  kableExtra::kable_styling(full_width = F)
# This is a useful chunk of code that installs any missing packages and then loads all of them. Useful if we're sharing code with others who may have older versions of RStudio installed (more recent versions have pop-up prompts that ask if we want to install missing packages)
pkgs <- c("tidyverse", # includes dplyr, ggplot2, and other really useful packages
          "magrittr", # for the `%<>%` function to reassign piping outcomes to the original variable
          "viridis", # popular colorblind-friendly palette
          "scales", # show_col() and scale_x_date()
          "knitr") # format tables for display 
installed_pkgs <- pkgs %in% installed.packages() # check which packages are not yet installed
if (length(pkgs[!installed_pkgs]) > 0) install.packages(pkgs[!installed_pkgs],dep=TRUE) # if some packages are not yet installed, go ahead and install them...
lapply(pkgs, library, character.only = TRUE) # ...then load all the packages and their dependencies
park_visits <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv") # visits to US National Parks

gas_price <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/gas_price.csv") # gas prices over space and time
# Examine the data to understand data structure, data types, and potential problems
View(park_visits); summary(park_visits)
View(gas_price); summary(gas_price)

# NOTES: 
# 1. When we preface function names with the package name, we avoid potential issues with duplicate function names among packages
# 2. The dplyr::select() function can simultaneously select columns, rename them, and order them as specified. See `?select`

park_sub <- park_visits %>%
  dplyr::filter( 
    region %in% c("IM", "PW", "SE"),
    unit_type %in% c("National Monument", "National Historic Site", "National Park"),
    year %in% 2000:2015) %>% # we would get the same result if we had enclosed the years in quotes, i.e., year %in% "2000":"2015"
  dplyr::select(region, unit_type, unit_name, unit_code, year, visitors) 

# summary(park_sub) # always check the data before and after wrangling!
glimpse(park_sub)
# `gas_constant` is gas price in constant 2015 dollars/gallon
# Confirm that both data frames have a column called `year` and that the data types match (always check data before and after wrangling!)
glimpse(park_sub) # `year` is a character data type
glimpse(gas_price) # `year` is a double data type (double and numeric data types are equivalent) 

# To join the two data frames by `year`, we need `year` to have the same data type in both data frames. Convert park_sub$year to numeric
park_sub$year <- as.numeric(park_sub$year) # can also use as.double()
class(park_sub$year)

# Confirm that there is one and only one `gas_constant` per year from 2000 to 2015. If there are multiple `gas_constant` values for any year, extra (near-duplicate) rows will be created for that year when we join the data frames 
subset(gas_price, year %in% 2000:2015) # since we're only working with 16 years of data, it's easy enough to check every year

joined_dat <- left_join(park_sub, gas_price[c("year", "gas_constant")], by = "year") # keep all records in park_sub, and add records from gas_price that match by year

dim(park_sub)
dim(joined_dat) # confirm that joined dat has the same number of rows and one extra column than park_sub

summary(joined_dat)
glimpse(joined_dat)
plants <- data.frame(
  plant_id = c(1:4, 1, 2),
  age_days = c(160, 160, 160, 160, 190, 190), 
  height_cm = c(21, 19, 32, 23, 24, 20),
  num_leaves = c(11, 12, 16, 8, 12, 14)
)

twitter_likes <- data.frame(
  political_party = rep(c("Democrat", "Republican", "Independent"), each = 3),
  animal = rep(c("Cat Food Breath", "Doug the Pug", "Jon Pigeon"), times = 3),
  votes = c(5201, 4760, 1200, 6912, 3448, 800, 3722, 2666, 5128)
) %>%
  tidyr::pivot_wider(names_from = animal, values_from = votes)

test_grades <- data.frame(
  student = c("Robin", "Marl", "Randi"),
  before_lunch = c(89, 77, 82),
  after_lunch = c(71, 76, 43)
)
# Plants 
knitr::kable(plants, caption = "A) Plants That Grow", align = "c") %>%
  kableExtra::kable_styling(full_width = F)
# Twitter Likes
knitr::kable(twitter_likes, caption = "B) Political Twitter Likes", align = "c") %>%
  kableExtra::kable_styling(full_width = F)
# Test Grades
knitr::kable(test_grades, caption = "C) Test Grades and Lunch", align = "c") %>%
  kableExtra::kable_styling(full_width = F)

# NOTES:
# 1. The factor data type is similar to (and often interchangeable with) the character data type, but limits entries to pre-specified values and allows us to specify an order to those values (other than the default alphabetical order)
# 2. Factors are especially useful for ordered categories (e.g., low < medium < high) with a small number of possible values
# 3. Under the hood, `R` store factors as integer vectors representing the order of the value (with character string labels)
# 4. I usually classify categorical variables as character data types, unless I want to limit entries to a few pre-specified values, or if I want to enforce a particular order to the category levels

unique(joined_dat$region) # check what values are in `region`

joined_dat$region <- 
  factor(joined_dat$region, 
         levels = c("PW", "IM", "SE"), # specifies the factor levels and an order for the levels. If we omit this argument, the factor levels will be the unique set of values in the column, ordered alphabetically
         labels = c("Pacific West", "Intermountain", "Southeast")) # relabels the levels with more descriptive names

# Check the data
glimpse(joined_dat) # `region` is now a factor data type
levels(joined_dat$region) # shows the (renamed) factor levels and their order
str(joined_dat$region) # shows the values are coded as integers indicating factor level order
summary(joined_dat) # the summary for `region` now counts the number of times each factor level occurs

length(2000:2015) # if we can't figure this out in our head, we probably need more sleep. But here is proof that there are 16 years in the range 2000 to 2015, inclusive

# Check if we have multiple records for any combination of unit_code and year (e.g., ARCH should not have multiple records for the year 2016)

# Short example to show how `duplicated` works
test <- data.frame(x = c(1, 1, 3, 2, 2), y = c(7, 2, 4, 1, 1), z = 1:5)
duplicated(test[c("x", "y")])
test[duplicated(test[c("x", "y")]), ]

# Now use `duplicated` on our data
joined_dat[duplicated(joined_dat[c("unit_code", "year")]), ] # good, no duplicates

# Identify park units with less than 16 years of data
remove_units <- joined_dat %>%
  count(unit_code) %>% # count() combines group_by() and summarize(n = n()) in a single function. 
  dplyr::filter(n != 16) # sort the data frame from lowest to highest count of unique years

# View(joined_dat) to check results before you remove these park units from the data

# Remove those park units from the data
viz_dat <- joined_dat %>%
  dplyr::filter(!unit_code %in% remove_units$unit_code)
  
# View(viz_dat) to make sure the park units have been removed. The final data frame should have 1856 rows and 7 columns.
# If you want to save `viz_dat` to your current working directory, run the code below. This code saves `viz_dat` as an .RDS file, which just means it saves it as an `R` object instead of as a .csv file. 

# When we save a data frame as an `R` object we retain the accompanying attribute information (e.g., the order of factor levels for `region`)--this information would not be saved in a .csv file.

saveRDS(viz_dat, "viz_dat.RDS")

# To read the RDS file from your working directory and assign it to `viz_dat`, use the code below. You can assign it to any name you want. If you're reading it from a location other than your working current directory, provide the file path in the function argument.
viz_dat <- readRDS("viz_dat.RDS")

tab_count <- joined_dat %>% 
  dplyr::distinct(region, unit_type, unit_code) %>%
  dplyr::count(region, unit_type) %>%
  tidyr::pivot_wider(names_from = region, values_from = n)

# View(joined_dat) to double-check the numbers
tab_count
library(tidyverse)

viz_dat <- readRDS("Data/viz_dat.RDS")

# This tab requires the `viz_dat` data frame created earlier in the module. If you don't have this data frames in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument

# NOTE: `lm` is part of the `stats` package that is automatically installed with `R`. This code comes from the Examples section of `?lm`.

# Create fake data
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)

# Simple linear regression of `weight` as a function of group
mod <- lm(weight ~ group) # assign model results to the variable `mod`
summary(mod) # model summary
class(mod) # class of `mod` is `lm`
plot(mod) # residual plots
# Check out `?boxplot`

# Subset the data
subdat <- viz_dat %>%
  dplyr::filter(
    year %in% c(2000, 2005, 2010, 2015),
    unit_type == "National Monument",
    region == "Intermountain")

boxplot(visitors ~ year, data = subdat) # the basic boxplot, with data provided as a formula

# An alternative way to feed the data as a formula
# boxplot(subdat$visitors ~ subdat$year) # no need for `data` argument

# Save this boxplot to `p_box` so we can look at the underlying structure
p_box <- boxplot(visitors/1000 ~ year, # so now, 200,000 will be shown as 200 on the y-axis
        data = subdat,
        main = "National Monuments in the Intermountain Region", # plot title
        xlab = "Year", # x-axis title
        ylab = "Number of visitors, in 1000's") # y-axis title
p_box
range(subdat$visitors[subdat$year==2000]) 

# Subset the data
arch_dat <- subset(viz_dat, unit_code == "ARCH")

# Visitors vs. gas price
plot(visitors/1000 ~ gas_constant, 
     data = arch_dat, 
     pch = 19, # change point shape to a solid circle
     cex = 1.5, # increase point size
     main = "Gas Prices vs Visitation at Arches National Park, 2000 - 2015", 
     xlab = "Gas price (constant 2015 dollars/gallon)", 
     ylab = "Number of visitors, in 1000's")
# Visitors vs. year
plot(visitors/1000 ~ year, data = arch_dat, pch = 19, col = ifelse(arch_dat$year == 2015, "red", "black"), cex = 1.5, main = "Arches National Park Visitation, 2000 - 2015", sub = "(red point is year 2015; all other years are shown in black)", xlab = "Year", ylab = "Number of visitors, in 1000's")
# Gas price vs. year
plot(gas_constant ~ year, data = arch_dat, pch = 19, col = ifelse(arch_dat$year == 2015, "red", "black"), cex = 1.5, main = "Average National Gas Prices, 2000 - 2015", sub = "(red point is year 2015; all other years are shown in black)", xlab = "Year", ylab = "Gas price (constant 2015 dollars/gallon)")

# Line graph
plot(visitors/1000 ~ year, data = arch_dat, type = "o", main = "Arches National Park Visitation, 2000 - 2015", xlab = "Year", ylab = "Number of visitors, in 1000's") # can use `type = "o" to get line plots with points
library(tidyverse)
library(ggthemes)

viz_dat <- readRDS("Data/viz_dat.RDS")

arch_dat <- subset(viz_dat, unit_code == "ARCH") # subset of data that is for Arches National park

# This tab requires the `viz_dat` and `arch_dat` data frames created earlier in the module. If you don't have these data frames in the global environment, run the code below.

viz_dat <- readRDS("viz_dat.RDS") # this file is provided with the training materials. If this file is not in your current working directory, provide the file path in the function argument

arch_dat <- subset(viz_dat, unit_code == "ARCH") # subset of data that is for Arches National park

# NOTES:
# 1. The first argument provided to `ggplot()` is assumed to be the data frame, so it's not necessary to include the argument name `data =`.
# 2. We are assigning the plot to a variable `p` so we can build on this `ggplot()` master template in the next step.
# 3. The parentheses around the line of code is a shortcut for `print`. Without it, the plot would assign to `p` but not print in the plots pane.

(p <- ggplot(data = arch_dat, aes(x = year, y = visitors/1000)))

summary(p) # summary of the information contained in the plot object

p$data # the plot object is self-contained. The underlying data frame is saved a list element in the plot object. 
# NOTE: We combine `ggplot()` layers with `+`, not with `%>%`. With a `ggplot()` object we are not piping sequential results but rather layering pieces. In this example we are building our `ggplot()` object in a particular order, but we could actually reorder the pieces in any way (as long as the master template is set first). The only difference would be that if different layers have conflicting information, information in later layers overrides information in earlier layers.

p + geom_line()
# Example of setting aesthetic mappings within the layer only
ggplot(data = arch_dat) +
  geom_line(aes(x = year, y = visitors/1000))
p + geom_line() + geom_point()
# Example of setting aesthetic mappings within the layer only
ggplot(data = arch_dat) +
  geom_line(aes(x = year, y = visitors/1000)) +
  geom_point()
p + layer(geom = "line", stat = "identity", position = "identity") # this creates the same line graph we had created using `geom_line()`
# NOTES:
# 1. Our plot code is getting long, so we will separate out the components to improve readability
# 2. We will assign this plot object to a new variable, `p2` so we can just add to `p2` in the next step instead of re-typing all these lines
# 3. `ggplot()` also provides shortcuts for some scaling arguments. For example, if we just want to set y-axis limits we can add the layer `ylim (600, 1500)` instead of setting limits via `scale_y_continuous()`

# Check out `?scale_y_continuous` for ways to modify the y-axis
(p2 <- p + 
  geom_line() + 
  geom_point() +
  scale_y_continuous(
    name = "Number of visitors, in 1000's", # y-axis title
    limits = c(600, 1500), # minimum and maximum values for y-axis
    breaks = seq(600, 1500, by = 100)) # label the axis at 600, 700, 800... up to 1500
)

(p3 <- p2 + 
  labs(
    title = "Arches National Park Visitation, 2000 - 2015", # plot title
    subtitle = "Visitor counts have increased each year since 2004", # plot subtitle
    x = "Year") # x-axis title. We had already set the y-axis title with `scale_y_continous()`.
 )
p3 + 
  theme_minimal() # try out different themes! theme_classic(), theme_void(), theme_linedraw()... 
p3 + 
  ggthemes::theme_fivethirtyeight()

p3 + 
  ggthemes::theme_fivethirtyeight() +
  theme(
    plot.subtitle = element_text(color = "red", size = 14, hjust=0.5)) # change plot subtitle font color and size, and center the subtitle
# NOTES:
# 1. For this plot we need to use the `viz_dat` dataset so we can facet by park unit name
# 2. In the `facet_wrap` arguments, `ncol = 1` sets a single column so plots are stacked on top of each other and `scales = "free_y"` allows the y-axis range to differ across the subplots (this is useful if the range of y-axis values is very different across subplots and we are more interested in comparing trends than in comparing absolute numbers among the subplots)

subdat <- subset(viz_dat, unit_code %in% c("ARCH", "GRCA", "YOSE")) # plot data for Arches National Park (ARCH), Grand Canyon National Park (GRCA), and Yosemite National Park (YOSE)

ggplot(subdat, aes(x = year, y = visitors/1000)) + 
  geom_line() + 
  geom_point() +
  labs(
    title = "National Park Visitation, 2000 - 2015",
    x = "Year", 
    y = "Number of visitors, in 1000's") +
  theme_bw() +
  theme(plot.title = element_text(hjust=0.5)) + # center the plot title
  facet_wrap(vars(unit_name), ncol = 1, scales = "free_y")
#------------------------
#        Day 3- GIS
#------------------------
#install.packages(c("sf", "spsurvey"))
library(dplyr) # for filter, select, mutate and %>%  
library(sf)
library(tmap)
library(spsurvey)

# Generate a random number for the seed
sample(1:100000, 1) #62051
set.seed(62051)

# Read in shapefiles from teams data folder
sara_bound1 <- st_read("./data/SARA_boundary_4269.shp")
sara_veg1 <- st_read("./data/SARA_Veg.shp")

# Check that the projections match; fix the one that doesn't match
st_crs(sara_bound1) == st_crs(sara_veg1) # FALSE- projections don't match.
# sara_bound1 needs to be reprojected to UTM Zone 18N NAD83. 
sara_bound <- st_transform(sara_bound1, crs = 26918)
st_crs(sara_bound) == st_crs(sara_veg1) # TRUE

# Quick plot
plot(sara_bound[1])
plot(sara_veg1[1]) # bigger extent than boundary

# Intersect boundary and veg to be same extend
sara_veg <- st_intersection(sara_veg1, sara_bound)
plot(sara_veg[1])
# View attribute table of layers
head(sara_bound) # 1 feature with 95 fields

str(sara_veg)
head(sara_veg)
names(sara_veg)
table(sara_veg$ANDERSONII)

# Simplify vegetation types for easier plotting
dev <- c("1. Urban or Built-up Land", "11. Residential", 
         "12. Commercial and Services", "13. Industrial",
         "14. Transportation, Communications, and Utilities", 
         "17. Other Urban or Built-up Land")
crop <- c("21. Cropland and Pasture", 
          "22. Orchards, Groves, Vineyards, and Nurseries", 
          "31. Herbaceous Rangeland")
shrubland <- c("32. Shrub and Brush Rangeland")
forest_up <- c("41. Deciduous Forest Land", "42. Evergreen Forest Land", 
               "43. Mixed Forest Land")
forest_wet <- c("61. Forested Wetland")
open_wet <- c("62. Nonforested wetland", "62. Nonforested Wetland")
water <- c("5. Water", "51. Streams and Canals", "53. Reservoirs")
unk <- "Unclassified"

# Create 2 fields in the veg attribute table: simp_veg, and fills
sara_veg <- sara_veg %>% 
  mutate(simp_veg = case_when(ANDERSONII %in% dev ~ "Developed",
                              ANDERSONII %in% crop ~ "Open field",
                              ANDERSONII %in% shrubland ~ "Shrublands",
                              ANDERSONII %in% forest_up ~ "Forest",
                              ANDERSONII %in% forest_wet ~ "Forested wetland",
                              ANDERSONII %in% open_wet ~ "Open wetland",
                              ANDERSONII %in% water ~ "Open water",
                              ANDERSONII %in% unk ~ "Unclassified",
                              TRUE ~ "Unknown"),
         fill_col = case_when(simp_veg == "Developed" ~ "#D8D8D8",
                              simp_veg == "Open field" ~ "#f5f0b0",
                              simp_veg == "Shrublands" ~ "#F29839",
                              simp_veg == "Powerline ROW" ~ "#F9421D",
                              simp_veg == "Forest" ~ "#55785e",
                              simp_veg == "Forested wetland" ~ "#9577a6",
                              simp_veg == "Open wetland" ~ "#c497d4",
                              simp_veg == "Open water" ~ "#AFD0F2",
                              simp_veg == "Unclassified" ~ "#ffffff"))

library(dplyr)
# Import data
wetdat <- read.csv(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Advanced/main/data/ACAD_wetland_data_clean.csv")

# Summarize so that each site x year combination has 1 row
wetsum <- wetdat %>% group_by(Site_Name, Year, X_Coord, Y_Coord) %>% 
  summarize(num_inv = sum(Invasive), num_prot = sum(Protected), 
            .groups = 'drop')
# Check first 6 rows of output
head(wetsum)

# Convert dataframe to sf
wetsum_sf <- st_as_sf(wetsum, coords = c("X_Coord", "Y_Coord"), crs = 26919) 
  # ACAD is UTM Zone 19N NAD83, hence the difference between SARA, which is 18N.

# Write sf to disk as shapefile
st_write(wetsum_sf, "ACAD_wetland_data.shp")
sara_grts <- grts(sara_bound, n_base = 100) # generate 100 points within SARA boundary
sara_grts$sites_base$priority <- as.numeric(row.names(sara_grts$sites_base)) # add priority number (same as row.name)
# Spatial join
grts_veg <- st_join(sara_grts$sites_base, sara_veg %>% select(simp_veg))
names(grts_veg)

# Create list of forest habitats
sort(unique(sara_veg$simp_veg))
forest_veg <- c("Forest", "Forested Wetland")

# Filter out non-forest points
grts_forest <- grts_veg %>% filter(simp_veg %in% forest_veg)
nrow(grts_veg) # 100 points
nrow(grts_forest) # fewer points
# Creating list of simp_veg types and their fill colors for easier legend
for_legend <- unique(data.frame(simp_veg = sara_veg$simp_veg, fill_col = sara_veg$fill_col)) 

sara_map <- 
  # Vegetation map
  tm_shape(sara_veg, projection = 26918, bbox = sara_bound) +
  tm_fill("fill_col") +
  tm_add_legend(type = 'fill', labels = for_legend$simp_veg, 
                col = for_legend$fill_col, z = 3) +

  # Park boundary
  tm_shape(sara_bound) +
  tm_borders('black', lwd = 2) +
  tm_add_legend(type = 'line', labels = "Park Boundary", col = "black",
                z = 2)+
    
  # GRTS points  
  tm_shape(grts_forest) +
  tm_symbols(shapes = 21, col = "#EAFF16", border.lwd = 0.5, size = 0.3) + 
  tm_text("priority", size = 0.9, xmod = -0.4, ymod = -0.4) +
  tm_add_legend(type = 'symbol', labels = "GRTS points", shape = 21, 
                col = "#EAFF16", border.lwd = 1, size = 0.5, 
                z = 1) +
  
  # Other map features
  tm_compass(size = 2, type = 'arrow', 
             text.size = 1, position = c('left', 'bottom')) +
  tm_scale_bar(text.size = 1.25, position = c("center", "bottom")) + 
  tm_layout(inner.margins = c(0.2, 0.02, 0.02, 0.02), # make room for legend
            outer.margins = 0,
            legend.text.size = 1.25,
            legend.just = 'right',
            legend.position = c("right", "bottom"),
            title = "Saratoga NHP GRTS points",
            title.position = c("center", "top"))

tmap_save(sara_map, "SARA_GRTS.png", height = 10.5, width = 8, 
          units = 'in', dpi = 600, outer.margins = 0)
sara_map
knitr::include_graphics("SARA_GRTS.png")
tmap_mode("view") # turn interactive mode on
sara_map
tmap_mode("plot") # return to default plot view
# Load library
library(leaflet)

# Load park tiles
NPSbasic = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck58pyquo009v01p99xebegr9/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"
  
NPSimagery = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck72fwp2642dv07o7tbqinvz4/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"
  
NPSslate = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpvc2e0avf01p9zaw4co8o/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"
  
NPSlight = "https://atlas-stg.geoplatform.gov/styles/v1/atlas-user/ck5cpia2u0auf01p9vbugvcpv/tiles/256/{z}/{x}/{y}@2x?access_token=pk.eyJ1IjoiYXRsYXMtdXNlciIsImEiOiJjazFmdGx2bjQwMDAwMG5wZmYwbmJwbmE2In0.lWXK2UexpXuyVitesLdwUg"

# Reproject SARA layers
sara_bnd_wgs <- st_transform(sara_bound, crs = 4326)
grts_forest_wgs <- st_transform(grts_forest, crs = 4326)

# Calculate centroid for leaflet map view
sara_cent <- st_centroid(sara_bnd_wgs)$geometry
str(sara_cent) # -73.63459, 42.99543
sara_map_lf <-
  leaflet() %>% 
  setView(lng = -73.63, lat = 42.99, zoom = 13) %>% 
  # parktiles
  addTiles(group = "Map",
           urlTemplate = NPSbasic) %>%
  addTiles(group = "Imagery",
           urlTemplate = NPSimagery) %>%
  addTiles(group = "Light",
           urlTemplate = NPSlight) %>%
  addTiles(group = "Slate",
           urlTemplate = NPSslate) %>% 
  addLayersControl(map = .,
    baseGroups = c("Map", "Imagery", "Light", "Slate"),
    options = layersControlOptions(collapsed = T)) %>% 
  # points
  addCircleMarkers(data = grts_forest_wgs,
                   lng = st_coordinates(grts_forest_wgs)[,1],
                   lat = st_coordinates(grts_forest_wgs)[,2],
                   label = grts_forest_wgs$priority,
                   labelOptions = labelOptions(noHide=T, textOnly = TRUE, 
                      direction = 'bottom', textsize = "12px"),
                   fillColor = "#EAFF16",
                   radius = 4,
                   stroke = FALSE, # turn off outline
                   fillOpacity = 1) %>% 
  # scale bar and settings
  addScaleBar(position = 'bottomright') %>% 
  scaleBarOptions(maxWidth = 10, metric = TRUE) 

sara_map_lf

# Retrieve the latest water quality data from the full dataset
pH_2020 <- dplyr::filter(pH_Data_All, WaterYear == 2020)
DissolvedOxygen_2020 <- dplyr::filter(DO_Data_All, WaterYear == 2020)
SpecificConductance_2020 <- dplyr::filter(SpCond_Data_All, WaterYear == 2020)
current_water_year <- 2020

pH_Current <- dplyr::filter(pH_Data_All, WaterYear == current_water_year)
DissolvedOxygen_Current <- dplyr::filter(DO_Data_All, WaterYear == current_water_year)
SpecificConductance_Current <- dplyr::filter(SpCond_Data_All, WaterYear == current_water_year)
# Numeric vectors
numbers <- 1:10
other_numbers <- 6:15

# Character vectors
code_lang <- c("R", "C++", "Java", "Python", "C")
human_lang <- c("English", "Spanish", "Chinese", "Russian")
new_numbers <- c()  # Create a new NULL vector

new_numbers[1] <- 2 * numbers[1]
new_numbers[2] <- 2 * numbers[2]
new_numbers[3] <- 2 * numbers[3]
new_numbers[4] <- 2 * numbers[4]
new_numbers[5] <- 2 * numbers[5]
new_numbers[6] <- 2 * numbers[6]
new_numbers[7] <- 2 * numbers[7]
new_numbers[8] <- 2 * numbers[8]
new_numbers[9] <- 2 * numbers[9]
new_numbers[10] <- 2 * numbers[10]

new_numbers
new_numbers <- numbers * 2
new_numbers
numbers
other_numbers

other_numbers - numbers
other_numbers * numbers
ones <- rep(1, 6)  # Vector of 1's, length 6
ones

add_this <- c(1, 2)
this_one_breaks <- c(1, 2, 3, 4)

ones + add_this  # This works fine
ones + this_one_breaks  # This generates a warning
messy_data <- tibble::tibble(plot_id = c("1 ", " 2 ", " ", " 4", "5."),
                             year_monitored = c(2019, 2019, 2019, 2020, 2020))
messy_data
clean_data <- messy_data %>% dplyr::mutate(plot_id = trimws(plot_id),  # Get rid of leading and trailing spaces
                                           plot_id = gsub(pattern = ".", replacement = "", x = plot_id, fixed = TRUE)) %>%  # Replace the stray "." with nothing
  dplyr::filter(plot_id != "") %>%
  dplyr::mutate(plot_id = as.integer(plot_id))
clean_data
plot_id_vec <- messy_data$plot_id
plot_id_vec
plot_id_vec <- trimws(plot_id_vec)
plot_id_vec
plot_id_vec <- gsub(pattern = ".", replacement = "", x = plot_id_vec, fixed = TRUE)
plot_id_vec
id_is_valid <- plot_id_vec != ""
id_is_valid
plot_id_vec <- plot_id_vec[id_is_valid]
plot_id_vec
plot_id_vec <- as.integer(plot_id_vec)
d <- c(1:4, 100*1:4)
d
q_logic<- d < (mean(d) / 10)
q_logic
ifelse(q_logic, yes="verify", no="OK") #yes is the true condition no is the false condition
q<-ifelse(d < (mean(d) / 10), "verify", "OK")
d <- c(1:4, 100*1:4, NA)
ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
 
ifelse(is.na(d) == T, "Missing",
  ifelse(d < mean(d, na.rm = T) / 10, "verify", "OK")
)
if(...){thing to do if TRUE}
      else{
      thing to do if FALSE
      }
if(...){thing to do if TRUE
           }else{
           thing to do if FALSE
           }
d <- c(1:4, 1:4 * 100, NA)

if (all(is.finite(d))) {
  "no NA, NaN, or Inf present"
} else {
  "NA present"
}
d <- c(1:4, 1:4 * 100)

if (!all(is.finite(d))) {
  "NA present"
} else if (all(d < 100)) {
  "all items less than 100"
} else {
  paste(sum(d > 100), "items more than 100")
}
for(i in vector){ 
  do thing 1
  do thing 2
  ...
  return output 
}
vec<-1:9
for (i in vec) { # for the number 1 through 9
         i
}   
i 
for (i in vec) { # for the number 1 through 9
         print(i+1)
}   
#alternatively predefine a container to receive the data
OutDat <- NULL # define an empty variable. This is naughty, see discussion on 'growing' dataframes later on
for (i in 1:9) {
  OutDat[i] <- i + 1 # store that in the ith position of the variable OutDat
}
OutDat
library(magrittr)
library(ggplot2)

#create a list of file names that we will read in
fNames <- paste0(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/",
  c(
    "APIS01_20548905_2021_temp.csv",
    "APIS02_20549198_2021_temp.csv",
    "APIS03_20557246_2021_temp.csv",
    "APIS04_20597702_2021_temp.csv",
    "APIS05_20597703_2021_temp.csv"
  )
)
fNames
#read in and inspect the first element (for training purposes)
temp<-read.csv(fNames[1], skip = 1, header = T) #hobo data has an extra line at the top we need to skip
head(temp)
temp<-read.csv(fNames[1], skip = 2, header = F)
names(temp)[1:3]<-c("idx", "DateTime", "T_F") #let me just assing some column names really quick
head(temp)
str(temp) #look at data types
rm(temp) #do a little cleanup 
len<-length(fNames) #calculate the number of iterations

SummaryData <- list() #Generate a list that we will put output into. Always rerun prior to loop.

for (i in 1:len) {
  # 1 Read in and generate the data to use
  # 1a extract the filename from a filepath and parse the metadata
  fName <- basename(fNames[i])
  Site <- stringr::str_sub(fName, 1, 6)
  Year <- as.numeric(stringr::str_sub(fName, -13, -10))
  # 1b read in the data, change headers, set time so we can do some math later
  d <- read.csv(fNames[i], skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F")) %>%
    dplyr::mutate(., DateTime2 = as.POSIXct(DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT"))
  # 2 summarize and 3 output the data
  SummaryData[[i]] <- data.frame( #output the data
    # 2 what to summarize for all files 
    Site = Site,
    Year = Year,
    FileName = fName,
    # Variable output
    # 2a What to summarize so long as there is 1 non-NA record
    if (any(is.numeric(d[, 3]))) { # check that there is at least 1 non-NA record
      list(
        T_F_mean = mean(d$T_F, na.rm = T),
        T_F_min = min(d$T_F, na.rm = T),
        T_F_max = max(d$T_F, na.rm = T),
        nObs = nrow(d),
        nDays = (d$DateTime2[nrow(d)] - d$DateTime2[1])[[1]], # why we created a posix
        Flag = NA
      ) 
      # 2b what to summarize if the data are missing.
    } else {
      list(
        T_F_mean = NA,
        T_F_min = NA,
        T_F_max = NA,
        nObs = nrow(d),
        nDays = NA,
        Flag = "Hobo File Present But No Data"
      )
    }
  )
}
dplyr::bind_rows(SummaryData)

OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this later)
# set up some QA thresholds
TempUpper <- 40
TempLower <- (-40)

for (i in fNames) {
  #1 Read in/generate data
  fName <- basename(i)
  d <- read.csv(i, skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F"))
  
  #2 Generate your output in this case, some QA/QC flagging
  d_com <- d %>%
    dplyr::mutate(
      # checks for too big of a temp change
      TChangeFlag = ifelse(c(abs(diff(T_F, lag = 1)) > 10, FALSE), "D10", NA),
      # flag if a temp is too high or too low
      Flag = dplyr::case_when(
        is.na(T_F) ~ "MISSING",
        T_F > TempUpper ~ "High",
        T_F < TempLower ~ "Low"
      )
    ) %>%
    # merge so you can have multiple flags applied
    tidyr::unite("Flag", c(TChangeFlag, Flag), sep = "_", remove = TRUE, na.rm = TRUE)

  #3 outptut the data
  dir.create(OutPath, showWarnings = F)
  write.csv(d_com[, c("DateTime", "T_F", "Flag")],
    paste0(OutPath, gsub(".csv", "_QC.csv", fName)),
    row.names = F, quote = F
  )
}
OutPath <- paste0(getwd(), "/hobo_ouputs/") # or put your preferred path here (we will use this

for (i in fNames) {
  #1 Read in/generate data
  fName <- basename(i)
  d <- read.csv(i, skip = 1, header = T)[1:3] %>%
    setNames(., c("idx", "DateTime", "T_F")) %>%
    dplyr::mutate(., DateTime2 = as.POSIXct(DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT"))
  #2 Do the thing you want to do, in this case make and save plots
  p <- ggplot(d, aes(x = DateTime2, y = T_F)) +
    geom_point(size = 0.5) +
    ggtitle(fName)
  #3 Output the results outside of the for loop
  dir.create(OutPath, showWarnings = F)
  ggsave(
    filename = gsub(".csv", "_plot.pdf", fName), 
    path = OutPath, device = "pdf",
    plot = p, width = 5, height = 5, units = "in"
  )
}
str(mtcars)
head(mtcars)
apply(mtcars,MARGIN=1, FUN=mean) #calculate mean across the row

apply(mtcars,MARGIN=2, FUN=mean)

fNames
HoboList<-lapply(fNames,read.csv, skip=1, header=T)
HoboList
HoboList<-lapply(HoboList,"[",,1:3)
HoboList  
HoboList<-lapply(HoboList,setNames, c("idx","DateTime","T_F"))
HoboList
HoboList<-lapply(fNames,read.csv, skip=1, header=T)%>% #1. read hobo data into a list
          lapply(.,"[",,1:3)%>% #2. Grab only first 3 columns
          lapply(.,setNames, c("idx","DateTime","T_F")) #3. fix column names
HoboSummary<-lapply(HoboList[-3], FUN=function(x){ #-3 drop that blank file for now
  d<-data.frame(
    Mean = mean(x$T_F, na.rm=T),
    Min = min(x$T_F, na.rm=T),
    Max = max(x$T_F, na.rm=T),
    nObs = nrow(x)
  )
})
do.call(rbind, HoboSummary) #equivalent to dplyr::bind_rows(HoboSummary)
guess <- 0  # Initialize the guess variable
number <- sample(1:10, 1)  # Get our random number
guess_count <- 0  # Number of guesses so far

while (guess != number && guess_count < 3) {  # Keep looping if the guess is wrong and we haven't hit our 3 guesses yet
  guess_count <- guess_count + 1
  message <- paste0("Guess ", guess_count, ": what is the random number?  ")
  guess <- as.integer(readline(prompt = message))
  
  if (guess == number) {
    print("Woohoo! You won the game!")
  } else if (guess_count < 3) {
    print("Nope, try again.")
  } else {
    print("You lost :(")
  }
}


library(magrittr)
library(ggplot2)
fNames <- paste0(
  "https://raw.githubusercontent.com/KateMMiller/IMD_R_Training_Intro/master/Data/",
  c(
    "APIS01_20548905_2021_temp.csv",
    "APIS02_20549198_2021_temp.csv",
    "APIS03_20557246_2021_temp.csv",
    "APIS04_20597702_2021_temp.csv",
    "APIS05_20597703_2021_temp.csv"
  )
)

OutPath <- choose.dir() # note, no trailing '\\' in path
TempUpper <- 40
TempLower <- (-40)

SummaryData <- NULL # make sure this gets cleared before rerunning this code
for (i in fNames) {
  d <- read.csv(i, skip = 1, header = T)[1:3]
  fName <- basename(i)
  print(fName) # dev/debug
  names(d) <- c("idx", "DateTime", "T_F")
  d$DateTime2 <- as.POSIXct(d$DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT")
  # summarize the data
  OutDat <- data.frame(
    FileName = fName,
    T_F_mean = mean(d$T_F, na.rm = T),
    T_F_min = min(d$T_F, na.rm = T),
    T_F_max = max(d$T_F, na.rm = T),
    nObs = nrow(d),
    nDays = (d$DateTime2[nrow(d)] - d$DateTime2[1])[[1]]
  ) # days
  # Append new summary data to old summary data
  SummaryData <- rbind(SummaryData, OutDat)
}
#effectively these are manual "breakpoints"

i=fNames[3] #code dev/debugging fragment
# rm(i) #debug.dev frag
SummaryData <- NULL # make sure this gets cleared before rerunning this code
# for(i in fNames){
d <- read.csv(i, skip = 1, header = T)[1:3]
d # check output
fName <- basename(i)
# print(fName)#dev/debug frag
names(d) <- c("idx", "DateTime", "T_F")
d$DateTime2 <- as.POSIXct(d$DateTime, "%m/%d/%y %H:%M:%S", tz = "UCT")
# summarize the data
OutDat <- data.frame(
  FileName = fName,
  T_F_mean = mean(d$T_F, na.rm = T),
  T_F_min = min(d$T_F, na.rm = T),
  T_F_max = max(d$T_F, na.rm = T),
  nObs = nrow(d),
  nDays = (d$DateTime2[nrow(d)] - d$DateTime2[1])[[1]]
) # days
# Append new summary data to old summary data
SummaryData <- rbind(SummaryData, OutDat)
#  }

Meet the Instructors

We are the people who designed and led the training for this week. We hope you found it useful, and that you keep with it!

Andrew Birch
WRD/IMD Water Quality Program Lead



Ellen Cheng
SER Quantitative Ecologist



Kate Miller
NETN/MIDN Quantitative Ecologist



Lauren Pandori
CABR Marine Biologist - MEDN



Thomas Parr
GLKN Program Manager



John Paul Schmit
NCRN Quantitative Ecologist



Sarah Wright
MOJN Data Scientist